5420 Anomaly Detection | Assignment 7 - Joyce Ng (jn2901)¶
The purpose of conducting anomaly analysis on healthcare/inpatient data is to identify and understand unusual patterns or outliers that could indicate potential issues or opportunities for improvement in healthcare delivery, cost management, and patient outcomes. Objective of this project is to perform exhaustive EDA on inpatient data and provide business insights of each feature. Then utilize Autoencoder and iForest to detect anomalies.¶
Desciption of dataset:
DRG Definition:
Diagnosis Related Group. Classification system that groups similar clinical conditions (diagnoses) and the procedures furnished by the hospital during the stay.
Total Discharges:
The number of discharges billed by all providers for inpatient hospital services.
Average Covered Charges:
= Total Covered Charge Amount / Total Discharges. The average charge of all provider's services covered by Medicare for discharges in the DRG. These will vary from hospital to hospital because of differences in hospital charge structures.
Average Total Payment:
= Total Payments / Total Discharges. The average total payments of what Medicare actually pays to the provider as well as co-payment and deductible amounts that the beneficiary is responsible for and payments by third parties for coordination of benefits.
Average Medicare Payment:
= Medicare Payment Amount / Total Discharges. The average amount that Medicare pays to the provider. Medicare payment amounts include the MS-DRG amount, teaching, disproportionate share, capital, and outlier payments for all cases. Medicare payments DO NOT include beneficiary co-payments and deductible amounts nor any additional payments from third parties for coordination of benefits.
1.1: Load Libraries and Dataset ¶
# Load libraries
import pandas as pd
import numpy as np
from functools import reduce
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.metrics import roc_curve,roc_auc_score, confusion_matrix, f1_score, accuracy_score, make_scorer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from pyod.models.combination import aom, moa, average, maximization
from pyod.utils.utility import standardizer
from sklearn import metrics
from pyod.models.knn import KNN
from pyod.models.auto_encoder import AutoEncoder
from pyod.models.iforest import IForest
# Visiualization
import plotly.express as px
import plotly.graph_objs as go
import plotly.subplots as sp
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly.io as pio
from IPython.display import display
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
data = pd.read_csv('/Users/Joyce630/Desktop/Columbia/5420 Anomaly Detection/Assignments/4 - Healthcare Fraud/Data/inpatientCharges.csv')
data.head()
| DRG Definition | Provider Id | Provider Name | Provider Street Address | Provider City | Provider State | Provider Zip Code | Hospital Referral Region Description | Total Discharges | Average Covered Charges | Average Total Payments | Average Medicare Payments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10001 | SOUTHEAST ALABAMA MEDICAL CENTER | 1108 ROSS CLARK CIRCLE | DOTHAN | AL | 36301 | AL - Dothan | 91 | $32963.07 | $5777.24 | $4763.73 |
| 1 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10005 | MARSHALL MEDICAL CENTER SOUTH | 2505 U S HIGHWAY 431 NORTH | BOAZ | AL | 35957 | AL - Birmingham | 14 | $15131.85 | $5787.57 | $4976.71 |
| 2 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10006 | ELIZA COFFEE MEMORIAL HOSPITAL | 205 MARENGO STREET | FLORENCE | AL | 35631 | AL - Birmingham | 24 | $37560.37 | $5434.95 | $4453.79 |
| 3 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10011 | ST VINCENT'S EAST | 50 MEDICAL PARK EAST DRIVE | BIRMINGHAM | AL | 35235 | AL - Birmingham | 25 | $13998.28 | $5417.56 | $4129.16 |
| 4 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10016 | SHELBY BAPTIST MEDICAL CENTER | 1000 FIRST STREET NORTH | ALABASTER | AL | 35007 | AL - Birmingham | 18 | $31633.27 | $5658.33 | $4851.44 |
print(data.shape, "\n") # Check the shape of df
print(data.columns, "\n") # Check column names
print(data.info(), "\n") # Check info of the df
(163065, 12)
Index(['DRG Definition', 'Provider Id', 'Provider Name',
'Provider Street Address', 'Provider City', 'Provider State',
'Provider Zip Code', 'Hospital Referral Region Description',
' Total Discharges ', ' Average Covered Charges ',
' Average Total Payments ', 'Average Medicare Payments'],
dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163065 entries, 0 to 163064
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DRG Definition 163065 non-null object
1 Provider Id 163065 non-null int64
2 Provider Name 163065 non-null object
3 Provider Street Address 163065 non-null object
4 Provider City 163065 non-null object
5 Provider State 163065 non-null object
6 Provider Zip Code 163065 non-null int64
7 Hospital Referral Region Description 163065 non-null object
8 Total Discharges 163065 non-null int64
9 Average Covered Charges 163065 non-null object
10 Average Total Payments 163065 non-null object
11 Average Medicare Payments 163065 non-null object
dtypes: int64(3), object(9)
memory usage: 14.9+ MB
None
1.2: Check Missing Values & Change Columns Names and Data Type ¶
# Check for missing values
missing_values = data.isnull().sum()
missing_values
DRG Definition 0 Provider Id 0 Provider Name 0 Provider Street Address 0 Provider City 0 Provider State 0 Provider Zip Code 0 Hospital Referral Region Description 0 Total Discharges 0 Average Covered Charges 0 Average Total Payments 0 Average Medicare Payments 0 dtype: int64
# remove white space in
data = data.rename(columns=lambda x: x.strip())
# Change data type
data = data.rename(columns=lambda x: x.strip())
data['DRG Definition'] = data['DRG Definition'].astype('category')
data['Provider State'] = data['Provider State'].astype('category')
data['Provider City'] = data['Provider City'].astype('category')
data['Provider Id'] = data['Provider Id'].astype('category')
# Remove $ sign from columns
data['Average Covered Charges'] = data['Average Covered Charges'].str.strip("$").astype((float))
data['Average Total Payments'] = data['Average Total Payments'].str.strip("$").astype((float))
data['Average Medicare Payments'] = data['Average Medicare Payments'].str.strip("$").astype((float))
# Remove white space in columns
data.columns = ['DRG','Provider_Id', 'Provider_Name','Provider_StreetAddress','Provider_City',
'Provider_State','Provider_Zipcode','Hospital_referral_region_desp',
'Total_Discharges','Average_Covered_Charges','Average_Total_Payments',
'Average_Medicare_Payments']
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 163065 entries, 0 to 163064 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 DRG 163065 non-null category 1 Provider_Id 163065 non-null category 2 Provider_Name 163065 non-null object 3 Provider_StreetAddress 163065 non-null object 4 Provider_City 163065 non-null category 5 Provider_State 163065 non-null category 6 Provider_Zipcode 163065 non-null int64 7 Hospital_referral_region_desp 163065 non-null object 8 Total_Discharges 163065 non-null int64 9 Average_Covered_Charges 163065 non-null float64 10 Average_Total_Payments 163065 non-null float64 11 Average_Medicare_Payments 163065 non-null float64 dtypes: category(4), float64(3), int64(2), object(3) memory usage: 11.1+ MB
data.head()
| DRG | Provider_Id | Provider_Name | Provider_StreetAddress | Provider_City | Provider_State | Provider_Zipcode | Hospital_referral_region_desp | Total_Discharges | Average_Covered_Charges | Average_Total_Payments | Average_Medicare_Payments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10001 | SOUTHEAST ALABAMA MEDICAL CENTER | 1108 ROSS CLARK CIRCLE | DOTHAN | AL | 36301 | AL - Dothan | 91 | 32963.07 | 5777.24 | 4763.73 |
| 1 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10005 | MARSHALL MEDICAL CENTER SOUTH | 2505 U S HIGHWAY 431 NORTH | BOAZ | AL | 35957 | AL - Birmingham | 14 | 15131.85 | 5787.57 | 4976.71 |
| 2 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10006 | ELIZA COFFEE MEMORIAL HOSPITAL | 205 MARENGO STREET | FLORENCE | AL | 35631 | AL - Birmingham | 24 | 37560.37 | 5434.95 | 4453.79 |
| 3 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10011 | ST VINCENT'S EAST | 50 MEDICAL PARK EAST DRIVE | BIRMINGHAM | AL | 35235 | AL - Birmingham | 25 | 13998.28 | 5417.56 | 4129.16 |
| 4 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10016 | SHELBY BAPTIST MEDICAL CENTER | 1000 FIRST STREET NORTH | ALABASTER | AL | 35007 | AL - Birmingham | 18 | 31633.27 | 5658.33 | 4851.44 |
unique_id = len(data.groupby('Provider_Id').count())
unique_providers = len(data.groupby('Provider_Name').count())
unique_states = len(data.groupby('Provider_State').count())
unique_cities = len(data.groupby('Provider_City').count())
unique_drg = len(data.groupby('DRG').count())
print(f'Unique ID = {unique_id}\n')
print(f'Unique Providers = {unique_providers}\n')
print(f'Unique States = {unique_states}\n')
print(f'Unique Cities = {unique_cities}\n')
print(f'Unique DRG = {unique_drg}\n')
Unique ID = 3337 Unique Providers = 3201 Unique States = 51 Unique Cities = 1977 Unique DRG = 100
data.describe()
| Provider_Zipcode | Total_Discharges | Average_Covered_Charges | Average_Total_Payments | Average_Medicare_Payments | |
|---|---|---|---|---|---|
| count | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 | 163065.000000 |
| mean | 47938.121908 | 42.776304 | 36133.954224 | 9707.473804 | 8494.490964 |
| std | 27854.323080 | 51.104042 | 35065.365931 | 7664.642598 | 7309.467261 |
| min | 1040.000000 | 11.000000 | 2459.400000 | 2673.000000 | 1148.900000 |
| 25% | 27261.000000 | 17.000000 | 15947.160000 | 5234.500000 | 4192.350000 |
| 50% | 44309.000000 | 27.000000 | 25245.820000 | 7214.100000 | 6158.460000 |
| 75% | 72901.000000 | 49.000000 | 43232.590000 | 11286.400000 | 10056.880000 |
| max | 99835.000000 | 3383.000000 | 929118.900000 | 156158.180000 | 154620.810000 |
2.1 Distribution of Avg. Payments ¶
# Create a figure
fig = go.Figure()
# Add histogram for Average Total Payments
fig.add_trace(go.Histogram(x=data['Average_Total_Payments'], name='Average_Total_Payments', opacity=1))
fig.add_trace(go.Histogram(x=data['Average_Medicare_Payments'], name='Average_Medicare_Payments', opacity=0.85))
# Update layout
fig.update_layout(
title_text='Distribution of Average Total & Medicare Payments',
xaxis_title_text='Payments ($)',
yaxis_title_text='Frequency',
height=800, width=950,
template = 'seaborn',
plot_bgcolor = '#F6F5F5'
)
# Show plot
fig.show()
Based on the graph, it's notable that the average Medicare payment exceeds the overall average payment, which is unusual and potentially concerning. Therefore, further investigation is necessary to identify and understand these discrepancies.
2.1.2 Distributions of Charges and Payments by States ¶
sns.pairplot(data[['Provider_State','Average_Total_Payments','Total_Discharges','Average_Medicare_Payments','Average_Covered_Charges']],
hue= 'Provider_State')
<seaborn.axisgrid.PairGrid at 0x287eebe10>
The above graph show the distributions of different charges and payments on each states. There are already some outliers showed on the graph which need further investigations.
2.1.3 DRG Frequency ¶
Identify the most common DRGs by grouping the DRG and total number of discharges, then sort in an descending order.
common_drg = data.groupby('DRG')
common_drg.agg({'Total_Discharges': np.sum}).sort_values(by=['Total_Discharges'], ascending=0)[:20]
| Total_Discharges | |
|---|---|
| DRG | |
| 470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC | 427207 |
| 871 - SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W MCC | 319072 |
| 392 - ESOPHAGITIS, GASTROENT & MISC DIGEST DISORDERS W/O MCC | 244854 |
| 292 - HEART FAILURE & SHOCK W CC | 222038 |
| 690 - KIDNEY & URINARY TRACT INFECTIONS W/O MCC | 206695 |
| 194 - SIMPLE PNEUMONIA & PLEURISY W CC | 198390 |
| 291 - HEART FAILURE & SHOCK W MCC | 185599 |
| 641 - MISC DISORDERS OF NUTRITION,METABOLISM,FLUIDS/ELECTROLYTES W/O MCC | 153660 |
| 683 - RENAL FAILURE W CC | 150444 |
| 190 - CHRONIC OBSTRUCTIVE PULMONARY DISEASE W MCC | 149677 |
| 191 - CHRONIC OBSTRUCTIVE PULMONARY DISEASE W CC | 148491 |
| 312 - SYNCOPE & COLLAPSE | 141918 |
| 603 - CELLULITIS W/O MCC | 140894 |
| 378 - G.I. HEMORRHAGE W CC | 138678 |
| 313 - CHEST PAIN | 131079 |
| 193 - SIMPLE PNEUMONIA & PLEURISY W MCC | 127989 |
| 287 - CIRCULATORY DISORDERS EXCEPT AMI, W CARD CATH W/O MCC | 115920 |
| 192 - CHRONIC OBSTRUCTIVE PULMONARY DISEASE W/O CC/MCC | 114790 |
| 310 - CARDIAC ARRHYTHMIA & CONDUCTION DISORDERS W/O CC/MCC | 113454 |
| 872 - SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W/O MCC | 112430 |
drg470 = data[data['DRG'].str.contains('470')]
The most common procedures finsihed by the hospital was 470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC, In the following I created a scatter plot to take a deeper look at the average medicare payment of this group.
# Create scatter plot using Plotly Express with color mapped to 'Average_Covered_Charges'
fig = px.scatter(drg470, x='Provider_State', y='Average_Medicare_Payments',
color='Average_Medicare_Payments',
color_continuous_scale='PuBuGn',
title='Average Medicare Payments for DRG 470 by State',
labels={'Provider_State': 'State', 'Average_Medicare_Payments': 'Average Medicare Payments'})
# Customize layout
fig.update_layout(
xaxis_title='State',
yaxis_title='Average Medicare Payments',
template='plotly_white'
)
# Show the figure
fig.show()
For DRG 470, the average medicare payments seems to be within 10k to 20k, with some states i.e. Maryland, California, New York over 25k or even above 30k.
Next, I'll take a look at how much each procedures had cost by each state.
def stat(data,group_var,cont_var):
coln =list(data[[group_var,cont_var]].describe().index)
row = data[[group_var]].drop_duplicates().reset_index(drop=True).values.tolist()
row = [item for sublist in row for item in sublist]
desc_cont_var = pd.DataFrame(data[[group_var,cont_var]].groupby(group_var).describe().values.tolist(), columns=coln,index=row)
return desc_cont_var
stat(data,'DRG','Average_Total_Payments').head(n=10)
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 1079.0 | 6960.534004 | 1477.873952 | 4968.00 | 6001.8300 | 6582.890 | 7516.8250 | 18420.56 |
| 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/O MCC | 1201.0 | 6706.276445 | 2033.965862 | 4194.09 | 5412.8700 | 6093.750 | 7345.3600 | 25519.43 |
| 069 - TRANSIENT ISCHEMIA | 1659.0 | 13263.823032 | 3847.918207 | 8174.28 | 10762.2200 | 12084.700 | 14424.3250 | 50882.40 |
| 064 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W MCC | 2269.0 | 7922.671141 | 2084.658336 | 5368.73 | 6626.2700 | 7280.050 | 8503.0600 | 26510.15 |
| 065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W CC | 1806.0 | 5713.985221 | 1342.538675 | 3916.41 | 4819.3250 | 5326.025 | 6197.4800 | 14744.05 |
| 066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W/O CC/MCC | 1962.0 | 5068.644292 | 1114.407934 | 3514.41 | 4320.7375 | 4740.235 | 5459.7375 | 12312.80 |
| 074 - CRANIAL & PERIPHERAL NERVE DISORDERS W/O MCC | 979.0 | 6386.793177 | 1597.733831 | 4344.17 | 5278.4200 | 5903.250 | 7060.7800 | 15759.14 |
| 101 - SEIZURES W/O MCC | 1593.0 | 5493.361883 | 1420.448742 | 3704.89 | 4530.4100 | 5058.160 | 6013.8600 | 15554.42 |
| 149 - DYSEQUILIBRIUM | 988.0 | 4589.104342 | 1082.833515 | 3185.77 | 3826.4325 | 4285.315 | 5036.1925 | 9325.21 |
| 176 - PULMONARY EMBOLISM W/O MCC | 1396.0 | 7279.954979 | 1544.880403 | 4891.30 | 6238.9800 | 6848.800 | 7933.6125 | 15783.05 |
stat(data,'DRG','Average_Medicare_Payments').head(n=10)
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 1079.0 | 5555.837525 | 1236.063439 | 3592.85 | 4759.8000 | 5269.280 | 6033.6100 | 15855.18 |
| 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/O MCC | 1201.0 | 5701.676570 | 1973.620083 | 3116.42 | 4467.6600 | 5104.780 | 6293.6000 | 22873.49 |
| 069 - TRANSIENT ISCHEMIA | 1659.0 | 12112.849445 | 3702.047762 | 6603.63 | 9765.6300 | 10956.160 | 13094.6300 | 48632.28 |
| 064 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W MCC | 2269.0 | 6781.501785 | 1979.482762 | 4109.25 | 5555.9100 | 6167.510 | 7342.0300 | 23402.26 |
| 065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W CC | 1806.0 | 4469.203560 | 1271.852222 | 2771.88 | 3653.9750 | 4079.575 | 4862.4650 | 13710.23 |
| 066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W/O CC/MCC | 1962.0 | 3952.275102 | 1094.682773 | 2348.07 | 3240.5050 | 3627.465 | 4298.7650 | 9959.58 |
| 074 - CRANIAL & PERIPHERAL NERVE DISORDERS W/O MCC | 979.0 | 5238.579438 | 1542.402076 | 2456.83 | 4201.5650 | 4815.660 | 5862.4700 | 13310.28 |
| 101 - SEIZURES W/O MCC | 1593.0 | 4452.600709 | 1379.483972 | 2629.69 | 3535.5800 | 4012.660 | 4966.9500 | 13961.18 |
| 149 - DYSEQUILIBRIUM | 988.0 | 3440.677065 | 1076.033108 | 2012.35 | 2717.5875 | 3145.555 | 3805.2575 | 8615.31 |
| 176 - PULMONARY EMBOLISM W/O MCC | 1396.0 | 6077.948660 | 1505.418988 | 3561.68 | 5122.6200 | 5644.930 | 6554.8175 | 15161.27 |
2.1.4 Choropleth Map ¶
def plot_map(data, group_var, cont_var, stat='mean', color='RdBu_r', label=''):
# Grouping the Data
if stat == 'count':
x = data.groupby(group_var).size() # Count number of rows per group
else:
x = data.groupby(group_var)[cont_var].agg(stat) # Compute specified statistic
# Plotting Heat Map
fig = go.Figure(go.Choropleth(
locations=list(x.index),
z=x.astype(float),
locationmode='USA-states',
colorscale=color,
colorbar_title=f'{stat.capitalize()} {label.lower()}',
))
fig.update_layout(
title_text=f'{cont_var.replace("_", " ").title()} Per State',
geo_scope='usa',
height=800, width=950
)
return fig
# Plot the distribution of variables by states using choropleth map
p = [
plot_map(data, "Provider_State", "Total_Discharges", stat='count'),
plot_map(data, "Provider_State", "Average_Covered_Charges", stat='mean', label="Covered Charges"),
plot_map(data, "Provider_State", "Average_Total_Payments", stat='mean', label="Total Payments"),
plot_map(data, "Provider_State", "Average_Medicare_Payments", stat='mean', label="Medicare Payments")
]
for fig in p:
fig.show()
Section 3: Feature Engineering ¶
3.1 Avgerage Medicare Payments ¶
For this section, benchmarks are calculated by grouping the DRGs, provider states, cities and zip code from the dataset, then take the mean of the Average medicare payments.
Four features are created under this section which are Average Medicare Payments by DRG, States, Cities & Zip Code. They can provides insights into healthcare payment patterns across various levels of granularity.
3.1.1 By DRG ¶
table_1 = data.groupby(['DRG'])['Average_Medicare_Payments'].mean().reset_index()
table_1 = table_1.rename(columns={'Average_Medicare_Payments': 'Average_Medicare_Payments_byDRG'})
table_1.head(10)
| DRG | Average_Medicare_Payments_byDRG | |
|---|---|---|
| 0 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 5555.837525 |
| 1 | 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... | 5701.676570 |
| 2 | 064 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 12112.849445 |
| 3 | 065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 6781.501785 |
| 4 | 066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFA... | 4469.203560 |
| 5 | 069 - TRANSIENT ISCHEMIA | 3952.275102 |
| 6 | 074 - CRANIAL & PERIPHERAL NERVE DISORDERS W/O... | 5238.579438 |
| 7 | 101 - SEIZURES W/O MCC | 4452.600709 |
| 8 | 149 - DYSEQUILIBRIUM | 3440.677065 |
| 9 | 176 - PULMONARY EMBOLISM W/O MCC | 6077.948660 |
To better identify outliers, additional quantiles i.e 90%, 95% are added by using the following function:¶
def calculate_descriptive_stats(data, column):
'''
This function calculate descriptive statistics with additional quantile.
'''
# Calculate basic descriptive statistics
desc_stats = data[column].describe()
# Calculate the 90th and 95th percentiles
perc_90 = data[column].quantile(0.90)
perc_95 = data[column].quantile(0.95)
# Convert the descriptive statistics to a DataFrame
desc_stats = desc_stats.to_frame().transpose()
# Add the 90th and 95th percentiles to the DataFrame
desc_stats['90%'] = perc_90
desc_stats['95%'] = perc_95
# Reorder the columns to match the desired format
desc_stats = desc_stats[['mean', 'min', '25%', '50%', '75%', '90%', '95%', 'max']]
# Transpose the DataFrame
desc_stats = desc_stats.T
return desc_stats
table_1_stats = calculate_descriptive_stats(table_1, 'Average_Medicare_Payments_byDRG')
table_1_stats
| Average_Medicare_Payments_byDRG | |
|---|---|
| mean | 9248.575795 |
| min | 2876.015593 |
| 25% | 4465.052848 |
| 50% | 6391.828157 |
| 75% | 11444.865846 |
| 90% | 17047.411069 |
| 95% | 21501.661996 |
| max | 41899.432929 |
Based on the above table, payments that exceeded the 95% threshold should be considered as anomalies.
# Add average medicare payments by states to the dataset
df1 = pd.merge(data, table_1, how='left', on=['DRG'])
# Calculate the ratio of the Average medicare payments to the Average medicare payments by DRG
df1['Average_Medicare_byDRG_Ratio'] = np.where(df1['Average_Medicare_Payments_byDRG']==0,0, df1['Average_Medicare_Payments'] / df1['Average_Medicare_Payments_byDRG'])
Distribution of Average Medicare Payments By DRG Ratio¶
# Define variables and quantile range
var = 'Average_Medicare_byDRG_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile)
# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)
# Update the layout
fig.update_layout(
title='Distribution of Average Medicare Payments By DRG Ratio',
xaxis_title='Quantile Range',
yaxis_title='Count',
template = 'simple_white'
)
# Show the plot
fig.show()
This feature focuses on understanding how Medicare payments vary across different DRGs. It helps in identifying anomalies such as DRGs with unusually high or low Medicare payments compared to the overall average.
3.1.2 Per DRG by State ¶
table_2 = data.groupby(['DRG', 'Provider_State'])['Average_Medicare_Payments'].mean().reset_index()
table_2.columns = ['DRG', 'Provider_State', 'Average_Medicare_Bystates']
table_2.head(10)
| DRG | Provider_State | Average_Medicare_Bystates | |
|---|---|---|---|
| 0 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | AK | 6413.780000 |
| 1 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | AL | 4599.593043 |
| 2 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | AR | 4938.712500 |
| 3 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | AZ | 5912.832917 |
| 4 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | CA | 6878.954925 |
| 5 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | CO | 5419.930000 |
| 6 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | CT | 6709.101333 |
| 7 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | DC | 8110.816667 |
| 8 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | DE | 5719.316667 |
| 9 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | FL | 4868.594568 |
table_2_stats = calculate_descriptive_stats(table_2, 'Average_Medicare_Bystates')
table_2_stats
| Average_Medicare_Bystates | |
|---|---|
| mean | 9289.670980 |
| min | 2216.372500 |
| 25% | 4483.274762 |
| 50% | 6635.284161 |
| 75% | 11044.008750 |
| 90% | 18103.081589 |
| 95% | 24208.021478 |
| max | 63646.045000 |
# Add average medicare payments by states to the dataset
df2 = pd.merge(data, table_2, how='left', on=['DRG', 'Provider_State'])
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df2['Average_Medicare_Bystates_Ratio'] = np.where(df2['Average_Medicare_Bystates']==0,0, df2['Average_Medicare_Payments'] / df2['Average_Medicare_Bystates'])
df2['Average_Medicare_Bystates_Ratio'].describe()
count 163065.000000 mean 1.000000 std 0.215091 min 0.339831 25% 0.868395 50% 0.954250 75% 1.071483 max 5.174415 Name: Average_Medicare_Bystates_Ratio, dtype: float64
Distribution of Average Medicare Payments Per DRG By State Ratio¶
# Define variables and quantile range
var = 'Average_Medicare_Bystates_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df2[binned_var] = pd.qcut(df2[var], percentile)
# Create the bin labels
df2['bin_label'] = df2[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df2['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)
# Update the layout
fig.update_layout(
title='Distribution of Average Medicare Payments State Ratio',
xaxis_title='Quantile Range',
yaxis_title='Count',
template = 'simple_white'
)
# Show the plot
fig.show()
This feature adds state-level granularity to the analysis allows for identifying regional disparities in Medicare payments for each DRG. It helps understand variations in healthcare costs influenced by state-level policies, demographics, and healthcare infrastructure. Payments fall witin the higher range i.e. [1.74-5.17] could be flagged as anomalies.
3.1.3 Per DRG by State & City ¶
table_3 = data.groupby(['DRG', 'Provider_State', 'Provider_City'])['Average_Medicare_Payments'].mean().reset_index()
table_3 = table_3.rename(columns={'Average_Medicare_Payments': 'Average_Medicare_Bystatescity'}).sort_values(by='Average_Medicare_Bystatescity', ascending=False)
table_3.head(10)
| DRG | Provider_State | Provider_City | Average_Medicare_Bystatescity | |
|---|---|---|---|---|
| 2126950 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | CA | STANFORD | 154620.81 |
| 9244283 | 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... | NY | VALHALLA | 133177.26 |
| 2125896 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | CA | FREMONT | 113462.09 |
| 9285667 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | CA | STANFORD | 109303.21 |
| 9345110 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | NY | VALHALLA | 107456.18 |
| 5312030 | 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC | NY | VALHALLA | 100040.26 |
| 6831303 | 469 - MAJOR JOINT REPLACEMENT OR REATTACHMENT ... | PA | HERSHEY | 99114.86 |
| 2186393 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | NY | VALHALLA | 98792.46 |
| 9184840 | 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... | CA | STANFORD | 94614.03 |
| 9361733 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | TX | GALVESTON | 89857.83 |
table_3_stats = calculate_descriptive_stats(table_3, 'Average_Medicare_Bystatescity')
table_3_stats
| Average_Medicare_Bystatescity | |
|---|---|
| mean | 8133.157995 |
| min | 1148.900000 |
| 25% | 4070.555000 |
| 50% | 5925.720000 |
| 75% | 9701.325000 |
| 90% | 14810.964000 |
| 95% | 20455.067000 |
| max | 154620.810000 |
# Add average medicare payments by states to the dataset
df3 = pd.merge(data, table_3, how='left', on=['DRG', 'Provider_State', 'Provider_City'])
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df3['Average_Medicare_Bystatescity_Ratio'] = np.where(df3['Average_Medicare_Bystatescity']==0,0, df3['Average_Medicare_Payments'] / df3['Average_Medicare_Bystatescity'])
Distribution of Average Medicare Payments By State and City Ratio¶
# Define variables and quantile range
var = 'Average_Medicare_Bystatescity_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df3[binned_var] = pd.qcut(df3[var], percentile, duplicates='drop')
# Create the bin labels
df3['bin_label'] = df3[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df3['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)
# Update the layout
fig.update_layout(
title='Distribution of Average Medicare Payments By State and City Ratio',
xaxis_title='Quantile Range',
yaxis_title='Count',
template = 'simple_white'
)
# Show the plot
fig.show()
# Calculate 95th percentile threshold
threshold = np.percentile(df3[var], 95)
# Create anomaly indicator
df3['Anomaly'] = np.where(df3[var] > threshold, 1, 0)
# Count anomalies
anomaly_counts = df3['Anomaly'].value_counts()
anomaly_counts
Anomaly 0 154911 1 8154 Name: count, dtype: int64
This feature further disaggregating by city within each state provides insights into localized healthcare payment patterns. It identifies cities where Medicare payments differ markedly across various DRGs, indicating potential anomalies in healthcare service delivery or cost structures at the city level.
3.1.4 Avgerage Medicare Payments per DRG by Zip Code ¶
table_4 = data.groupby(['DRG', 'Provider_Zipcode'])['Average_Medicare_Payments'].mean().reset_index()
table_4 = table_4.rename(columns={'Average_Medicare_Payments': 'Average_Medicare_Byzip'}).sort_values(by='Average_Medicare_Byzip', ascending=False)
table_4.head(10)
| DRG | Provider_Zipcode | Average_Medicare_Byzip | |
|---|---|---|---|
| 66990 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | 94305 | 154620.81 |
| 278041 | 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... | 10595 | 133177.26 |
| 66996 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | 94538 | 113462.09 |
| 283753 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | 94305 | 109303.21 |
| 281094 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | 10595 | 107456.18 |
| 283585 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | 90095 | 101717.82 |
| 158974 | 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC | 10595 | 100040.26 |
| 204951 | 469 - MAJOR JOINT REPLACEMENT OR REATTACHMENT ... | 17033 | 99114.86 |
| 64331 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | 10595 | 98792.46 |
| 278317 | 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... | 20060 | 95701.42 |
table_4_stats = calculate_descriptive_stats(table_4, 'Average_Medicare_Byzip')
table_4_stats
| Average_Medicare_Byzip | |
|---|---|
| mean | 8455.257017 |
| min | 1148.900000 |
| 25% | 4172.160000 |
| 50% | 6125.210000 |
| 75% | 10014.280000 |
| 90% | 15625.386000 |
| 95% | 21653.620000 |
| max | 154620.810000 |
# Add average medicare payments by states to the dataset
df4 = pd.merge(data, table_4, how='left', on=['DRG', 'Provider_Zipcode'])
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df4['Average_Medicare_Byzip_Ratio'] = np.where(df4['Average_Medicare_Byzip']==0,0, df4['Average_Medicare_Payments'] / df4['Average_Medicare_Byzip'])
Distribution of Average Medicare Payments By Zip Code Ratio¶
# Define variables and quantile range
var = 'Average_Medicare_Byzip_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df4[binned_var] = pd.qcut(df4[var], percentile, duplicates='drop')
# Create the bin labels
df4['bin_label'] = df4[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df4['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)
# Update the layout
fig.update_layout(
title='Distribution of Average Medicare Payments By Zip Code Ratio',
xaxis_title='Quantile Range',
yaxis_title='Count',
template = 'simple_white'
)
# Show the plot
fig.show()
# Calculate 95th percentile threshold
threshold = np.percentile(df4[var], 95)
# Create anomaly indicator
df4['Anomaly'] = np.where(df4[var] > threshold, 1, 0)
# Count anomalies
anomaly_counts = df4['Anomaly'].value_counts()
anomaly_counts
Anomaly 0 155070 1 7995 Name: count, dtype: int64
The count of anomaly will mainly performed on 'by state & city' and 'by zip code' because those two levels provide more indepth insights.
This feature add zip code-level granularity offers the most detailed view of Medicare payment variations within specific cities. It identifies zip codes where Medicare payments deviate significantly within the same city and state, pinpointing localized anomalies or disparities in healthcare utilization and costs.
3.2 Average Total Payments ¶
For this section, benchmarks are calculated by grouping the DRGs, provider states, cities and zip code from the dataset, then take the mean of the Average Total payments.
Three features are created under this section which are Average Total Payments by DRG, States, Cities & Zip Code. They can provides comprehensive insights into healthcare expenditure patterns across various levels of granularity.
3.2.1 Per DRG by State ¶
table_5 = data.groupby(['DRG', 'Provider_State'])['Average_Total_Payments'].mean().reset_index()
table_5 = table_5.rename(columns={'Average_Total_Payments': 'Average_Total_Bystate'}).sort_values(by='Average_Total_Bystate', ascending=False)
table_5.head(10)
| DRG | Provider_State | Average_Total_Bystate | |
|---|---|---|---|
| 2652 | 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC | AK | 68006.425000 |
| 4648 | 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... | DC | 66437.692500 |
| 4692 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | AK | 61300.630000 |
| 4641 | 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... | AK | 57238.680000 |
| 4652 | 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... | HI | 56482.922000 |
| 4726 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | NY | 56465.890641 |
| 1071 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | AK | 54385.475000 |
| 4696 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | CA | 53733.333710 |
| 4738 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | VT | 52761.330000 |
| 4699 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | DC | 52491.154000 |
table_5_stats = calculate_descriptive_stats(table_5, 'Average_Total_Bystate')
table_5_stats
| Average_Total_Bystate | |
|---|---|
| mean | 10640.517669 |
| min | 3236.026333 |
| 25% | 5583.888889 |
| 50% | 7780.726250 |
| 75% | 12493.163000 |
| 90% | 19845.450487 |
| 95% | 27083.356325 |
| max | 68006.425000 |
# Add average medicare payments by states to the dataset
df5 = pd.merge(data, table_5, how='left', on=['DRG', 'Provider_State'])
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df5['Average_Total_Bystates_Ratio'] = np.where(df5['Average_Total_Bystate']==0,0, df5['Average_Total_Payments'] / df5['Average_Total_Bystate'])
Distribution of Average Total Payments By State Ratio¶
# Define variables and quantile range
var = 'Average_Total_Bystates_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df5[binned_var] = pd.qcut(df5[var], percentile)
# Create the bin labels
df5['bin_label'] = df5[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df5['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.dense)
# Update the layout
fig.update_layout(
title='Distribution of Average Total Payments By State Ratio',
xaxis_title='Quantile Range',
yaxis_title='Count',
template = 'simple_white'
)
# Show the plot
fig.show()
This feature adds state-level granularity to the analysis identifies regional variations in total payments for each DRG.
3.2.2 Per DRG by State & City ¶
table_6 = data.groupby(['DRG', 'Provider_State', 'Provider_City'])['Average_Total_Payments'].mean().reset_index()
table_6 = table_6.rename(columns={'Average_Total_Payments': 'Average_Total_Bystatescity'}).sort_values(by='Average_Total_Bystatescity', ascending=False)
table_6.head(10)
| DRG | Provider_State | Provider_City | Average_Total_Bystatescity | |
|---|---|---|---|---|
| 2126950 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | CA | STANFORD | 156158.18 |
| 9244283 | 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... | NY | VALHALLA | 140255.26 |
| 2125896 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | CA | FREMONT | 119113.00 |
| 9345110 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | NY | VALHALLA | 119028.90 |
| 9285667 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | CA | STANFORD | 109945.57 |
| 5312030 | 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC | NY | VALHALLA | 101796.26 |
| 6831303 | 469 - MAJOR JOINT REPLACEMENT OR REATTACHMENT ... | PA | HERSHEY | 100018.33 |
| 2186393 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | NY | VALHALLA | 99929.46 |
| 9184840 | 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... | CA | STANFORD | 95302.74 |
| 9361733 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | TX | GALVESTON | 90701.58 |
table_6_stats = calculate_descriptive_stats(table_6, 'Average_Total_Bystatescity')
table_6_stats
| Average_Total_Bystatescity | |
|---|---|
| mean | 9316.580074 |
| min | 2673.000000 |
| 25% | 5105.865000 |
| 50% | 6969.000000 |
| 75% | 10860.125000 |
| 90% | 16328.638000 |
| 95% | 22435.810500 |
| max | 156158.180000 |
# Add average medicare payments by states to the dataset
df6 = pd.merge(data, table_6, how='left', on=['DRG', 'Provider_State', 'Provider_City'])
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df6['Average_Total_Bystatescity_Ratio'] = np.where(df6['Average_Total_Bystatescity']==0,0, df6['Average_Total_Payments'] / df6['Average_Total_Bystatescity'])
Distribution of Average Total Payments By State and City Ratio¶
# Define variables and quantile range
var = 'Average_Total_Bystatescity_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df6[binned_var] = pd.qcut(df6[var], percentile, duplicates='drop')
# Create the bin labels
df6['bin_label'] = df6[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df6['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.dense)
# Update the layout
fig.update_layout(
title='Distribution of Average Total Payments By State and City Ratio',
xaxis_title='Quantile Range',
yaxis_title='Count',
template = 'simple_white'
)
# Show the plot
fig.show()
# Calculate 95th percentile threshold
threshold = np.percentile(df6[var], 95)
# Create anomaly indicator
df6['Anomaly'] = np.where(df6[var] > threshold, 1, 0)
# Count anomalies
anomaly_counts = df6['Anomaly'].value_counts()
anomaly_counts
Anomaly 0 154911 1 8154 Name: count, dtype: int64
Simliar to Average medicare payments, this feature add additional level of granularity in detecting anomalies.
3.2.3 Per DRG by Zip Code ¶
table_7 = data.groupby(['DRG', 'Provider_Zipcode'])['Average_Total_Payments'].mean().reset_index()
table_7 = table_7.rename(columns={'Average_Total_Payments': 'Average_Total_Byzip'}).sort_values(by='Average_Total_Byzip', ascending=False)
table_7.head(10)
| DRG | Provider_Zipcode | Average_Total_Byzip | |
|---|---|---|---|
| 66990 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | 94305 | 156158.18 |
| 278041 | 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... | 10595 | 140255.26 |
| 66996 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | 94538 | 119113.00 |
| 281094 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | 10595 | 119028.90 |
| 283753 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | 94305 | 109945.57 |
| 283585 | 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS | 90095 | 103802.96 |
| 158974 | 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC | 10595 | 101796.26 |
| 204951 | 469 - MAJOR JOINT REPLACEMENT OR REATTACHMENT ... | 17033 | 100018.33 |
| 64331 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | 10595 | 99929.46 |
| 159489 | 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC | 29220 | 99307.30 |
table_7_stats = calculate_descriptive_stats(table_7, 'Average_Total_Byzip')
table_7_stats
| Average_Total_Byzip | |
|---|---|
| mean | 9665.549998 |
| min | 2682.640000 |
| 25% | 5215.000000 |
| 50% | 7178.760000 |
| 75% | 11233.000000 |
| 90% | 17190.889000 |
| 95% | 23786.226000 |
| max | 156158.180000 |
# Add average medicare payments by states to the dataset
df7 = pd.merge(data, table_7, how='left', on=['DRG', 'Provider_Zipcode'])
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df7['Average_Total_Byzip_Ratio'] = np.where(df7['Average_Total_Byzip']==0,0, df7['Average_Total_Payments'] / df7['Average_Total_Byzip'])
Distribution of Average Total Payments By Zip Code Ratio¶
# Define variables and quantile range
var = 'Average_Total_Byzip_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df7[binned_var] = pd.qcut(df7[var], percentile, duplicates='drop')
# Create the bin labels
df7['bin_label'] = df7[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df7['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.dense)
# Update the layout
fig.update_layout(
title='Distribution of Average Total Payments By Zip Code Ratio',
xaxis_title='Quantile Range',
yaxis_title='Count',
template = 'simple_white'
)
# Show the plot
fig.show()
# Calculate 95th percentile threshold
threshold = np.percentile(df7[var], 95)
# Create anomaly indicator
df7['Anomaly'] = np.where(df7[var] > threshold, 1, 0)
# Count anomalies
anomaly_counts = df7['Anomaly'].value_counts()
anomaly_counts
Anomaly 0 155056 1 8009 Name: count, dtype: int64
Simliar to avg medicare payments, this feature add additional level of granularity.
This section of Average Total Payments focuses all costs associated with healthcare services, offering a broader perspective on anomalies related to overall service costs, resource, and patient payments.
3.3 Average Out of Pocket Payments ¶
For this section, benchmarks are calculated by using Total payments minus Medicare payments to get the Average Out-Of-Pocket Payments of Patients. Then group the DRGs, provider states, cities and zip code from the dataset, then take the mean of the Average OOP payments.
Three features are created under this section which are Average OOP Payments per DRG, by States, Cities & Zip Code. They can provides insights into healthcare expenditure patterns across various levels of granularity from the patient's perspective. Hospital that charges a higher than average OOP cost could considered anomalies.
3.3.1 Per DRG by States ¶
data['Average_OOP'] = data['Average_Total_Payments'] - data['Average_Medicare_Payments']
data.head()
| DRG | Provider_Id | Provider_Name | Provider_StreetAddress | Provider_City | Provider_State | Provider_Zipcode | Hospital_referral_region_desp | Total_Discharges | Average_Covered_Charges | Average_Total_Payments | Average_Medicare_Payments | Average_OOP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10001 | SOUTHEAST ALABAMA MEDICAL CENTER | 1108 ROSS CLARK CIRCLE | DOTHAN | AL | 36301 | AL - Dothan | 91 | 32963.07 | 5777.24 | 4763.73 | 1013.51 |
| 1 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10005 | MARSHALL MEDICAL CENTER SOUTH | 2505 U S HIGHWAY 431 NORTH | BOAZ | AL | 35957 | AL - Birmingham | 14 | 15131.85 | 5787.57 | 4976.71 | 810.86 |
| 2 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10006 | ELIZA COFFEE MEMORIAL HOSPITAL | 205 MARENGO STREET | FLORENCE | AL | 35631 | AL - Birmingham | 24 | 37560.37 | 5434.95 | 4453.79 | 981.16 |
| 3 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10011 | ST VINCENT'S EAST | 50 MEDICAL PARK EAST DRIVE | BIRMINGHAM | AL | 35235 | AL - Birmingham | 25 | 13998.28 | 5417.56 | 4129.16 | 1288.40 |
| 4 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10016 | SHELBY BAPTIST MEDICAL CENTER | 1000 FIRST STREET NORTH | ALABASTER | AL | 35007 | AL - Birmingham | 18 | 31633.27 | 5658.33 | 4851.44 | 806.89 |
table_8 = data.groupby(['DRG', 'Provider_State'])['Average_OOP'].mean().reset_index()
table_8 = table_8.rename(columns={'Average_OOP': 'Average_OOP_Bystates'})
table_8.head(10)
| DRG | Provider_State | Average_OOP_Bystates | |
|---|---|---|---|
| 0 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | AK | 1988.170000 |
| 1 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | AL | 1144.018696 |
| 2 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | AR | 1180.116250 |
| 3 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | AZ | 1358.722917 |
| 4 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | CA | 1508.977761 |
| 5 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | CO | 1631.992000 |
| 6 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | CT | 1301.466000 |
| 7 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | DC | 1383.156667 |
| 8 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | DE | 1408.896667 |
| 9 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | FL | 1334.597778 |
table_8_stats = calculate_descriptive_stats(table_8, 'Average_OOP_Bystates')
table_8_stats
| Average_OOP_Bystates | |
|---|---|
| mean | 1350.846689 |
| min | 359.710000 |
| 25% | 957.941455 |
| 50% | 1105.953333 |
| 75% | 1424.579545 |
| 90% | 2109.421204 |
| 95% | 2687.367642 |
| max | 19964.515000 |
# Add average medicare payments by states to the dataset
df8 = pd.merge(data, table_8, how='left', on=['DRG', 'Provider_State'])
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df8['Average_OOP_Bystates_Ratio'] = np.where(df8['Average_OOP_Bystates']==0,0, df8['Average_OOP'] / df8['Average_OOP_Bystates'])
Distribution of Average Out-Of-Pocket Payments By State Ratio¶
# Define variables and quantile range
var = 'Average_OOP_Bystates_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df8[binned_var] = pd.qcut(df8[var], percentile)
# Create the bin labels
df8['bin_label'] = df8[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df8['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Purpor_r)
# Update the layout
fig.update_layout(
title='Distribution of Average Out-Of-Pocket Payments By State Ratio',
xaxis_title='Quantile Range',
yaxis_title='Count',
template = 'simple_white'
)
# Show the plot
fig.show()
This feature provides insights into the disparities in patient expenses across different states. Payments fall into the high range i.e. [1.91-3.20], [3.20-38.19] could be could be consider anomalies.
3.3.2 Per DRG by State & City ¶
table_9 = data.groupby(['DRG', 'Provider_State', 'Provider_City'])['Average_OOP'].mean().reset_index()
table_9 = table_9.rename(columns={'Average_OOP': 'Average_OOP_Bystatescity'}).sort_values(by='Average_OOP_Bystatescity', ascending=False)
table_9.head(10)
| DRG | Provider_State | Provider_City | Average_OOP_Bystatescity | |
|---|---|---|---|---|
| 2916206 | 249 - PERC CARDIOVASC PROC W NON-DRUG-ELUTING ... | WA | BELLEVUE | 75998.660 |
| 6748071 | 460 - SPINAL FUSION EXCEPT CERVICAL W/O MCC | WA | FEDERAL WAY | 51457.650 |
| 3054632 | 252 - OTHER VASCULAR PROCEDURES W MCC | IN | BLOOMINGTON | 41686.250 |
| 2054178 | 203 - BRONCHITIS & ASTHMA W/O CC/MCC | MA | ATTLEBORO | 39691.910 |
| 2805011 | 247 - PERC CARDIOVASC PROC W DRUG-ELUTING STEN... | SD | SIOUX FALLS | 37708.605 |
| 2682933 | 246 - PERC CARDIOVASC PROC W DRUG-ELUTING STEN... | NJ | BERLIN | 37034.820 |
| 6723696 | 460 - SPINAL FUSION EXCEPT CERVICAL W/O MCC | NY | WEST ISLIP | 35437.920 |
| 1286505 | 189 - PULMONARY EDEMA & RESPIRATORY FAILURE | PA | READING | 33117.260 |
| 5329100 | 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC | TX | MCKINNEY | 32245.250 |
| 2551805 | 244 - PERMANENT CARDIAC PACEMAKER IMPLANT W/O ... | IN | RICHMOND | 32146.450 |
table_9_stats = calculate_descriptive_stats(table_9, 'Average_OOP_Bystatescity')
table_9_stats
| Average_OOP_Bystatescity | |
|---|---|
| mean | 1183.422079 |
| min | 0.000000 |
| 25% | 783.055000 |
| 50% | 938.000000 |
| 75% | 1204.592778 |
| 90% | 1846.114000 |
| 95% | 2579.456000 |
| max | 75998.660000 |
# Add average medicare payments by states to the dataset
df9 = pd.merge(data, table_9, how='left', on=['DRG', 'Provider_State', 'Provider_City'])
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df9['Average_OOP_Bystatescity_Ratio'] = np.where(df9['Average_OOP_Bystatescity']==0,0, df9['Average_OOP'] / df9['Average_OOP_Bystatescity'])
Distribution of Average Out-Of-Pocket Payments By State & City Ratio¶
# Define variables and quantile range
var = 'Average_OOP_Bystatescity_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df9[binned_var] = pd.qcut(df9[var], percentile, duplicates='drop')
# Create the bin labels
df9['bin_label'] = df9[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df9['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Purpor)
# Update the layout
fig.update_layout(
title='Distribution of Average Out-Of-Pocket Payments By State & City Ratio',
xaxis_title='Quantile Range',
yaxis_title='Count',
template = 'simple_white'
)
# Show the plot
fig.show()
# Calculate 95th percentile threshold
threshold = np.percentile(df9[var], 95)
# Create anomaly indicator
df9['Anomaly'] = np.where(df9[var] > threshold, 1, 0)
# Count anomalies
anomaly_counts = df9['Anomaly'].value_counts()
anomaly_counts
Anomaly 0 154911 1 8154 Name: count, dtype: int64
Similar as the previous features, this adds city level granularity. Payments fall in the higher bin could be flagged as anomalies.
3.3.3 Per DRG by Zip Code ¶
table_10 = data.groupby(['DRG', 'Provider_Zipcode'])['Average_OOP'].mean().reset_index()
table_10 = table_10.rename(columns={'Average_OOP': 'Average_OOP_Byzip'}).sort_values(by='Average_OOP_Byzip', ascending=False)
table_10.head(10)
| DRG | Provider_Zipcode | Average_OOP_Byzip | |
|---|---|---|---|
| 88487 | 249 - PERC CARDIOVASC PROC W NON-DRUG-ELUTING ... | 98004 | 75998.66 |
| 84205 | 247 - PERC CARDIOVASC PROC W DRUG-ELUTING STEN... | 57108 | 74168.03 |
| 37116 | 189 - PULMONARY EDEMA & RESPIRATORY FAILURE | 19605 | 65079.84 |
| 204500 | 460 - SPINAL FUSION EXCEPT CERVICAL W/O MCC | 98003 | 51457.65 |
| 65978 | 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... | 60657 | 41849.50 |
| 93126 | 252 - OTHER VASCULAR PROCEDURES W MCC | 47403 | 41686.25 |
| 61115 | 203 - BRONCHITIS & ASTHMA W/O CC/MCC | 2703 | 39691.91 |
| 79549 | 246 - PERC CARDIOVASC PROC W DRUG-ELUTING STEN... | 8009 | 37034.82 |
| 201759 | 460 - SPINAL FUSION EXCEPT CERVICAL W/O MCC | 11795 | 35437.92 |
| 90311 | 251 - PERC CARDIOVASC PROC W/O CORONARY ARTERY... | 57108 | 34721.65 |
table_10_stats = calculate_descriptive_stats(table_10, 'Average_OOP_Byzip')
table_10_stats
| Average_OOP_Byzip | |
|---|---|
| mean | 1210.292981 |
| min | 0.000000 |
| 25% | 778.770000 |
| 50% | 938.000000 |
| 75% | 1222.000000 |
| 90% | 1923.914000 |
| 95% | 2715.580000 |
| max | 75998.660000 |
# Add average medicare payments by states to the dataset
df10 = pd.merge(data, table_10, how='left', on=['DRG', 'Provider_Zipcode'])
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df10['Average_OOP_Byzip_Ratio'] = np.where(df10['Average_OOP_Byzip']==0,0, df10['Average_OOP'] / df10['Average_OOP_Byzip'])
Distribution of Average Out-Of-Pocket Payments By Zip Code Ratio¶
# Define variables and quantile range
var = 'Average_OOP_Byzip_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df10[binned_var] = pd.qcut(df10[var], percentile, duplicates='drop')
# Create the bin labels
df10['bin_label'] = df10[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df10['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Purpor)
# Update the layout
fig.update_layout(
title='Distribution of Average Out-Of-Pocket Payments By Zip Code Ratio',
xaxis_title='Quantile Range',
yaxis_title='Count',
template = 'simple_white'
)
# Show the plot
fig.show()
# Calculate 95th percentile threshold
threshold = np.percentile(df10[var], 95)
# Create anomaly indicator
df10['Anomaly'] = np.where(df10[var] > threshold, 1, 0)
# Count anomalies
anomaly_counts = df10['Anomaly'].value_counts()
anomaly_counts
Anomaly 0 155070 1 7995 Name: count, dtype: int64
Similar as the previous features, this adds more level of granularity. Payments fall in the higher bin could be flagged as anomalies.
Data Preparation for Modeling¶
# Append new features back to orginal dataset for future anlaysis
# Make a copy of the original DataFrame
data_copy = data.copy()
# List of DataFrames
dfs = [df1, df2, df3, df4, df5, df6, df7, df8, df9, df10]
# Function to filter and merge only the 'Ratio' columns
def merge_ratio_columns(left, right):
ratio_columns = [col for col in right.columns if col.endswith('Ratio')]
return pd.merge(left, right[['DRG', 'Provider_Id', 'Provider_State', 'Provider_City', 'Provider_Zipcode'] + ratio_columns],
on=['DRG', 'Provider_Id', 'Provider_State', 'Provider_City', 'Provider_Zipcode'], how='left')
# Use reduce to iteratively merge all DataFrames with the ratio columns into df_copy
final_df = reduce(merge_ratio_columns, dfs, data_copy)
# Display the final merged DataFrame
final_df.head()
| DRG | Provider_Id | Provider_Name | Provider_StreetAddress | Provider_City | Provider_State | Provider_Zipcode | Hospital_referral_region_desp | Total_Discharges | Average_Covered_Charges | ... | Average_Medicare_byDRG_Ratio | Average_Medicare_Bystates_Ratio | Average_Medicare_Bystatescity_Ratio | Average_Medicare_Byzip_Ratio | Average_Total_Bystates_Ratio | Average_Total_Bystatescity_Ratio | Average_Total_Byzip_Ratio | Average_OOP_Bystates_Ratio | Average_OOP_Bystatescity_Ratio | Average_OOP_Byzip_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10001 | SOUTHEAST ALABAMA MEDICAL CENTER | 1108 ROSS CLARK CIRCLE | DOTHAN | AL | 36301 | AL - Dothan | 91 | 32963.07 | ... | 0.857428 | 1.035685 | 1.038763 | 1.0 | 1.005855 | 1.037810 | 1.0 | 0.885921 | 1.033356 | 1.0 |
| 1 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10005 | MARSHALL MEDICAL CENTER SOUTH | 2505 U S HIGHWAY 431 NORTH | BOAZ | AL | 35957 | AL - Birmingham | 14 | 15131.85 | ... | 0.895762 | 1.081989 | 1.000000 | 1.0 | 1.007653 | 1.000000 | 1.0 | 0.708782 | 1.000000 | 1.0 |
| 2 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10006 | ELIZA COFFEE MEMORIAL HOSPITAL | 205 MARENGO STREET | FLORENCE | AL | 35631 | AL - Birmingham | 24 | 37560.37 | ... | 0.801642 | 0.968301 | 1.000000 | 1.0 | 0.946260 | 1.000000 | 1.0 | 0.857643 | 1.000000 | 1.0 |
| 3 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10011 | ST VINCENT'S EAST | 50 MEDICAL PARK EAST DRIVE | BIRMINGHAM | AL | 35235 | AL - Birmingham | 25 | 13998.28 | ... | 0.743211 | 0.897723 | 0.899200 | 1.0 | 0.943232 | 0.915208 | 1.0 | 1.126205 | 0.970586 | 1.0 |
| 4 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10016 | SHELBY BAPTIST MEDICAL CENTER | 1000 FIRST STREET NORTH | ALABASTER | AL | 35007 | AL - Birmingham | 18 | 31633.27 | ... | 0.873215 | 1.054754 | 1.000000 | 1.0 | 0.985152 | 1.000000 | 1.0 | 0.705312 | 1.000000 | 1.0 |
5 rows × 23 columns
# Extracting only the ratio columns
ratio_columns = [col for col in final_df.columns if col.endswith('Ratio')]
X = final_df[ratio_columns]
# Splitting data into train and test sets for X (features only)
X_train, X_test = train_test_split(X, test_size=0.2, random_state=123)
X_train
| Average_Medicare_byDRG_Ratio | Average_Medicare_Bystates_Ratio | Average_Medicare_Bystatescity_Ratio | Average_Medicare_Byzip_Ratio | Average_Total_Bystates_Ratio | Average_Total_Bystatescity_Ratio | Average_Total_Byzip_Ratio | Average_OOP_Bystates_Ratio | Average_OOP_Bystatescity_Ratio | Average_OOP_Byzip_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|
| 22044 | 0.950037 | 0.874802 | 1.000000 | 1.000000 | 0.897649 | 1.000000 | 1.000000 | 1.088349 | 1.000000 | 1.000000 |
| 85476 | 1.859740 | 1.463951 | 1.095857 | 1.000000 | 1.398264 | 1.074484 | 1.000000 | 0.483648 | 0.589728 | 1.000000 |
| 126489 | 0.937190 | 0.733368 | 0.927008 | 0.927008 | 0.838599 | 0.973378 | 0.973378 | 1.426914 | 1.136762 | 1.136762 |
| 93821 | 1.147512 | 1.333409 | 1.137058 | 1.000000 | 1.227748 | 1.110102 | 1.000000 | 0.659707 | 0.882715 | 1.000000 |
| 133143 | 0.754529 | 0.844225 | 1.000000 | 1.000000 | 0.876606 | 1.000000 | 1.000000 | 1.003489 | 1.000000 | 1.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 146449 | 1.490217 | 1.595831 | 1.261310 | 1.000000 | 1.528690 | 1.246431 | 1.000000 | 1.118468 | 1.130201 | 1.000000 |
| 119906 | 1.336942 | 1.454418 | 1.263847 | 1.289831 | 1.246224 | 1.195356 | 1.219424 | 0.635439 | 0.876429 | 0.892346 |
| 17730 | 1.432104 | 1.433615 | 1.073340 | 1.000000 | 1.365176 | 1.062598 | 1.000000 | 0.606056 | 0.841591 | 1.000000 |
| 28030 | 1.658940 | 1.287742 | 1.000000 | 1.000000 | 1.266112 | 1.000000 | 1.000000 | 1.156397 | 1.000000 | 1.000000 |
| 15725 | 1.311969 | 1.048922 | 0.926972 | 1.000000 | 1.041244 | 0.925228 | 1.000000 | 0.902966 | 0.890192 | 1.000000 |
130452 rows × 10 columns
# Standardize data
X_train = StandardScaler().fit_transform(X_train)
X_train = pd.DataFrame(X_train)
X_test = StandardScaler().fit_transform(X_test)
X_test = pd.DataFrame(X_test)
X_train.shape
(130452, 10)
Autoencoder is a type of neural network used for unsupervised learning of efficient codings. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for the purpose of dimensionality reduction or anomaly detection. It detect anomalies by measuring the reconstruction error. Data points with high reconstruction error are considered anomalies. It consists of two parts:
- Encoder: Maps the input data to a lower-dimensional representation.
- Decoder: Maps the encoded data back to the original data space.
Use Cases:
- Anomaly Detection
- Dimensinality Reduction
- Denoising
4.1 Model 1 ¶
clf1 = AutoEncoder(contamination=0.05, hidden_neurons =[25, 2, 2, 25])
clf1.fit(X_train)
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ dense (Dense) │ (None, 10) │ 110 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dropout (Dropout) │ (None, 10) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense_1 (Dense) │ (None, 10) │ 110 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dropout_1 (Dropout) │ (None, 10) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense_2 (Dense) │ (None, 25) │ 275 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dropout_2 (Dropout) │ (None, 25) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense_3 (Dense) │ (None, 2) │ 52 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dropout_3 (Dropout) │ (None, 2) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense_4 (Dense) │ (None, 2) │ 6 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dropout_4 (Dropout) │ (None, 2) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense_5 (Dense) │ (None, 25) │ 75 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dropout_5 (Dropout) │ (None, 25) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense_6 (Dense) │ (None, 10) │ 260 │ └─────────────────────────────────┴───────────────────────────┴────────────┘
Total params: 888 (3.47 KB)
Trainable params: 888 (3.47 KB)
Non-trainable params: 0 (0.00 B)
None Epoch 1/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 331us/step - loss: 9.4096 - val_loss: 1.3466 Epoch 2/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 318us/step - loss: 1.2268 - val_loss: 1.0661 Epoch 3/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 318us/step - loss: 1.0227 - val_loss: 1.0231 Epoch 4/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 318us/step - loss: 0.9917 - val_loss: 1.0182 Epoch 5/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 321us/step - loss: 0.9987 - val_loss: 1.0176 Epoch 6/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 317us/step - loss: 0.9991 - val_loss: 1.0172 Epoch 7/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 641us/step - loss: 1.0018 - val_loss: 1.0172 Epoch 8/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 326us/step - loss: 0.9929 - val_loss: 1.0172 Epoch 9/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 321us/step - loss: 0.9940 - val_loss: 1.0172 Epoch 10/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 335us/step - loss: 1.0036 - val_loss: 1.0172 Epoch 11/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 359us/step - loss: 1.0009 - val_loss: 1.0172 Epoch 12/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 346us/step - loss: 0.9975 - val_loss: 1.0172 Epoch 13/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 350us/step - loss: 1.0038 - val_loss: 1.0172 Epoch 14/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 337us/step - loss: 0.9980 - val_loss: 1.0172 Epoch 15/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 597us/step - loss: 1.0119 - val_loss: 1.0172 Epoch 16/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 331us/step - loss: 1.0007 - val_loss: 1.0172 Epoch 17/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 330us/step - loss: 1.0023 - val_loss: 1.0172 Epoch 18/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 330us/step - loss: 1.0091 - val_loss: 1.0172 Epoch 19/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 327us/step - loss: 0.9884 - val_loss: 1.0172 Epoch 20/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 322us/step - loss: 1.0019 - val_loss: 1.0172 Epoch 21/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 322us/step - loss: 0.9907 - val_loss: 1.0172 Epoch 22/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 321us/step - loss: 0.9980 - val_loss: 1.0172 Epoch 23/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 320us/step - loss: 0.9965 - val_loss: 1.0172 Epoch 24/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 318us/step - loss: 0.9945 - val_loss: 1.0172 Epoch 25/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 320us/step - loss: 0.9993 - val_loss: 1.0172 Epoch 26/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 321us/step - loss: 1.0095 - val_loss: 1.0172 Epoch 27/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 323us/step - loss: 1.0021 - val_loss: 1.0172 Epoch 28/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 319us/step - loss: 1.0051 - val_loss: 1.0172 Epoch 29/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 318us/step - loss: 1.0006 - val_loss: 1.0172 Epoch 30/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 321us/step - loss: 1.0036 - val_loss: 1.0172 Epoch 31/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 320us/step - loss: 0.9978 - val_loss: 1.0172 Epoch 32/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 318us/step - loss: 0.9995 - val_loss: 1.0172 Epoch 33/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 327us/step - loss: 0.9957 - val_loss: 1.0172 Epoch 34/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 324us/step - loss: 1.0080 - val_loss: 1.0172 Epoch 35/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 336us/step - loss: 1.0028 - val_loss: 1.0172 Epoch 36/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 321us/step - loss: 1.0011 - val_loss: 1.0172 Epoch 37/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 321us/step - loss: 1.0028 - val_loss: 1.0172 Epoch 38/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 320us/step - loss: 0.9950 - val_loss: 1.0172 Epoch 39/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 318us/step - loss: 1.0163 - val_loss: 1.0172 Epoch 40/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 406us/step - loss: 0.9981 - val_loss: 1.0172 Epoch 41/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 350us/step - loss: 0.9921 - val_loss: 1.0172 Epoch 42/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 319us/step - loss: 0.9984 - val_loss: 1.0172 Epoch 43/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 325us/step - loss: 0.9915 - val_loss: 1.0172 Epoch 44/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 327us/step - loss: 0.9846 - val_loss: 1.0172 Epoch 45/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 320us/step - loss: 1.0127 - val_loss: 1.0172 Epoch 46/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 319us/step - loss: 0.9994 - val_loss: 1.0172 Epoch 47/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 318us/step - loss: 0.9902 - val_loss: 1.0172 Epoch 48/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 318us/step - loss: 1.0028 - val_loss: 1.0172 Epoch 49/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 317us/step - loss: 0.9923 - val_loss: 1.0172 Epoch 50/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 319us/step - loss: 0.9998 - val_loss: 1.0172 Epoch 51/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 323us/step - loss: 1.0002 - val_loss: 1.0172 Epoch 52/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 328us/step - loss: 1.0109 - val_loss: 1.0172 Epoch 53/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 320us/step - loss: 1.0099 - val_loss: 1.0172 Epoch 54/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 321us/step - loss: 1.0034 - val_loss: 1.0172 Epoch 55/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 325us/step - loss: 1.0007 - val_loss: 1.0172 Epoch 56/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 322us/step - loss: 1.0091 - val_loss: 1.0172 Epoch 57/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 321us/step - loss: 1.0037 - val_loss: 1.0172 Epoch 58/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 323us/step - loss: 0.9957 - val_loss: 1.0172 Epoch 59/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 322us/step - loss: 0.9833 - val_loss: 1.0172 Epoch 60/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 324us/step - loss: 0.9982 - val_loss: 1.0172 Epoch 61/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 323us/step - loss: 0.9951 - val_loss: 1.0172 Epoch 62/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 424us/step - loss: 1.0189 - val_loss: 1.0172 Epoch 63/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 325us/step - loss: 1.0119 - val_loss: 1.0172 Epoch 64/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 326us/step - loss: 0.9928 - val_loss: 1.0172 Epoch 65/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 325us/step - loss: 0.9972 - val_loss: 1.0172 Epoch 66/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 324us/step - loss: 1.0097 - val_loss: 1.0172 Epoch 67/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 357us/step - loss: 1.0071 - val_loss: 1.0172 Epoch 68/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 321us/step - loss: 1.0083 - val_loss: 1.0172 Epoch 69/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 322us/step - loss: 0.9977 - val_loss: 1.0172 Epoch 70/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 329us/step - loss: 1.0079 - val_loss: 1.0172 Epoch 71/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 331us/step - loss: 1.0026 - val_loss: 1.0172 Epoch 72/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 332us/step - loss: 0.9974 - val_loss: 1.0172 Epoch 73/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 339us/step - loss: 0.9897 - val_loss: 1.0172 Epoch 74/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 372us/step - loss: 0.9830 - val_loss: 1.0172 Epoch 75/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 330us/step - loss: 0.9991 - val_loss: 1.0172 Epoch 76/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 375us/step - loss: 0.9904 - val_loss: 1.0172 Epoch 77/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 355us/step - loss: 0.9804 - val_loss: 1.0172 Epoch 78/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 319us/step - loss: 0.9965 - val_loss: 1.0172 Epoch 79/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 318us/step - loss: 0.9942 - val_loss: 1.0172 Epoch 80/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 334us/step - loss: 1.0076 - val_loss: 1.0172 Epoch 81/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 537us/step - loss: 0.9975 - val_loss: 1.0172 Epoch 82/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 352us/step - loss: 1.0003 - val_loss: 1.0172 Epoch 83/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 389us/step - loss: 1.0101 - val_loss: 1.0172 Epoch 84/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 339us/step - loss: 0.9761 - val_loss: 1.0172 Epoch 85/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 399us/step - loss: 1.0048 - val_loss: 1.0172 Epoch 86/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 355us/step - loss: 0.9818 - val_loss: 1.0172 Epoch 87/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 356us/step - loss: 1.0224 - val_loss: 1.0172 Epoch 88/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 393us/step - loss: 1.0202 - val_loss: 1.0172 Epoch 89/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 325us/step - loss: 0.9974 - val_loss: 1.0172 Epoch 90/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 322us/step - loss: 1.0037 - val_loss: 1.0172 Epoch 91/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 338us/step - loss: 0.9960 - val_loss: 1.0172 Epoch 92/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 3s 886us/step - loss: 0.9916 - val_loss: 1.0172 Epoch 93/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 382us/step - loss: 1.0014 - val_loss: 1.0172 Epoch 94/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 447us/step - loss: 0.9919 - val_loss: 1.0172 Epoch 95/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 331us/step - loss: 1.0041 - val_loss: 1.0172 Epoch 96/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 328us/step - loss: 0.9991 - val_loss: 1.0172 Epoch 97/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 323us/step - loss: 1.0028 - val_loss: 1.0172 Epoch 98/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 322us/step - loss: 0.9938 - val_loss: 1.0172 Epoch 99/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 323us/step - loss: 0.9955 - val_loss: 1.0172 Epoch 100/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 323us/step - loss: 0.9972 - val_loss: 1.0172 4077/4077 ━━━━━━━━━━━━━━━━━━━━ 1s 159us/step
AutoEncoder(batch_size=32, contamination=0.05, dropout_rate=0.2, epochs=100,
hidden_activation='relu', hidden_neurons=[25, 2, 2, 25],
l2_regularizer=0.1,
loss=<function mean_squared_error at 0x16e35dbc0>, optimizer='adam',
output_activation='sigmoid', preprocessing=True, random_state=None,
validation_size=0.1, verbose=1)
# Get y train scores
y_train_scores = clf1.decision_scores_
threshold = clf1.threshold_
# get the prediction on the test data
y_test_pred = clf1.predict(X_test) # outlier labels (0 or 1)
# Predict anomaly score of X using fitted detector
y_test_scores = clf1.decision_function(X_test) # outlier scores
y_test_pred = pd.Series(y_test_pred)
y_test_scores = pd.Series(y_test_scores)
1020/1020 ━━━━━━━━━━━━━━━━━━━━ 0s 173us/step 1020/1020 ━━━━━━━━━━━━━━━━━━━━ 0s 160us/step
print(y_test_scores.describe(), '\n')
print("Threshold for Contamination Rate:", clf1.threshold_, '\n')
clf1.get_params()
count 32613.000000 mean 2.257349 std 2.214617 min 0.035631 25% 0.990732 50% 1.493886 75% 2.679246 max 55.760914 dtype: float64 Threshold for Contamination Rate: 6.585086208915477
{'batch_size': 32,
'contamination': 0.05,
'dropout_rate': 0.2,
'epochs': 100,
'hidden_activation': 'relu',
'hidden_neurons': [25, 2, 2, 25],
'l2_regularizer': 0.1,
'loss': <function keras.src.losses.losses.mean_squared_error(y_true, y_pred)>,
'optimizer': 'adam',
'output_activation': 'sigmoid',
'preprocessing': True,
'random_state': None,
'validation_size': 0.1,
'verbose': 1}
y_test_pred.value_counts()
0 30997 1 1616 Name: count, dtype: int64
# Function to plot histograms with threshold annotation and color differentiation
def plot_histogram_with_threshold(scores, title, threshold):
fig = go.Figure()
# Split the scores into two parts: below the threshold and above/equal to the threshold
below_threshold = scores[scores < threshold]
above_threshold = scores[scores >= threshold]
# Add histogram traces for the two parts
fig.add_trace(go.Histogram(x=below_threshold, name='Normal', marker=dict(color='navy')))
fig.add_trace(go.Histogram(x=above_threshold, name='Outliers', marker=dict(color='magenta')))
# Add vertical line annotating the threshold
fig.add_vline(x=threshold, line_dash="dash", line_color="#0d5087", annotation_text="Threshold",
annotation_position="top right", annotation=dict(font=dict(color="#27496d", size=12)))
# Update layout
fig.update_layout(
title=title,
xaxis_title="Anomaly Score",
yaxis_title="Frequency",
showlegend=True,
height=450,
width=800,
bargroupgap=0.1,
bargap=0.1,
paper_bgcolor='rgba(0,0,0,0)',
plot_bgcolor='rgba(0,0,0,0)'
)
fig.update_xaxes(range=[-0.1, 15])
# Show the plot
fig.show()
# Plotting the combination by average for test data
plot_histogram_with_threshold(np.array(y_test_scores), "Anomaly Scores for Clf1 Model", threshold)
Model 1 Stats¶
def descriptive_stat_threshold(df,pred_score, threshold):
# Let's see how many '0's and '1's.
df = pd.DataFrame(df)
df['Anomaly_Score'] = pred_score
df['Group'] = np.where(df['Anomaly_Score']< threshold, 'Normal', 'Outlier')
# Now let's show the summary statistics:
cnt = df.groupby('Group')['Anomaly_Score'].count().reset_index().rename(columns={'Anomaly_Score':'Count'})
cnt['Count %'] = (cnt['Count'] / cnt['Count'].sum()) * 100 # The count and count %
stat = df.groupby('Group').mean().round(4).reset_index() # The avg.
stat = cnt.merge(stat, left_on='Group',right_on='Group') # Put the count and the avg. together
return (stat)
model_1 = descriptive_stat_threshold(X_test,y_test_scores, threshold)
model_1
| Group | Count | Count % | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Anomaly_Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Normal | 30997 | 95.044921 | -0.0703 | -0.076 | -0.0625 | -0.0079 | -0.0930 | -0.0741 | -0.0113 | -0.0572 | -0.0390 | -0.0040 | 1.8847 |
| 1 | Outlier | 1616 | 4.955079 | 1.3489 | 1.458 | 1.1988 | 0.1511 | 1.7835 | 1.4205 | 0.2177 | 1.0980 | 0.7479 | 0.0776 | 9.4060 |
4.2 Model 2 ¶
clf2 = AutoEncoder(contamination=0.05, hidden_neurons =[25, 10, 2, 10, 25])
clf2.fit(X_train)
Model: "sequential_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ dense_7 (Dense) │ (None, 10) │ 110 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dropout_6 (Dropout) │ (None, 10) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense_8 (Dense) │ (None, 10) │ 110 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dropout_7 (Dropout) │ (None, 10) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense_9 (Dense) │ (None, 25) │ 275 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dropout_8 (Dropout) │ (None, 25) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense_10 (Dense) │ (None, 10) │ 260 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dropout_9 (Dropout) │ (None, 10) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense_11 (Dense) │ (None, 2) │ 22 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dropout_10 (Dropout) │ (None, 2) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense_12 (Dense) │ (None, 10) │ 30 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dropout_11 (Dropout) │ (None, 10) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense_13 (Dense) │ (None, 25) │ 275 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dropout_12 (Dropout) │ (None, 25) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense_14 (Dense) │ (None, 10) │ 260 │ └─────────────────────────────────┴───────────────────────────┴────────────┘
Total params: 1,342 (5.24 KB)
Trainable params: 1,342 (5.24 KB)
Non-trainable params: 0 (0.00 B)
None Epoch 1/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 3s 402us/step - loss: 7.7795 - val_loss: 1.3388 Epoch 2/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 366us/step - loss: 1.2038 - val_loss: 1.0723 Epoch 3/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 368us/step - loss: 1.0422 - val_loss: 1.0297 Epoch 4/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 372us/step - loss: 1.0156 - val_loss: 1.0248 Epoch 5/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 354us/step - loss: 0.9989 - val_loss: 1.0239 Epoch 6/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 351us/step - loss: 0.9979 - val_loss: 1.0237 Epoch 7/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 365us/step - loss: 0.9875 - val_loss: 1.0237 Epoch 8/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 388us/step - loss: 0.9978 - val_loss: 1.0237 Epoch 9/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 363us/step - loss: 0.9980 - val_loss: 1.0239 Epoch 10/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 370us/step - loss: 0.9968 - val_loss: 1.0236 Epoch 11/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 352us/step - loss: 1.0054 - val_loss: 1.0236 Epoch 12/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 362us/step - loss: 0.9961 - val_loss: 1.0236 Epoch 13/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 357us/step - loss: 1.0104 - val_loss: 1.0236 Epoch 14/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 357us/step - loss: 0.9977 - val_loss: 1.0236 Epoch 15/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 358us/step - loss: 0.9966 - val_loss: 1.0236 Epoch 16/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 377us/step - loss: 0.9947 - val_loss: 1.0236 Epoch 17/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 354us/step - loss: 1.0099 - val_loss: 1.0236 Epoch 18/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 361us/step - loss: 0.9948 - val_loss: 1.0236 Epoch 19/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 352us/step - loss: 0.9891 - val_loss: 1.0236 Epoch 20/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 351us/step - loss: 1.0029 - val_loss: 1.0236 Epoch 21/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 347us/step - loss: 1.0241 - val_loss: 1.0236 Epoch 22/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 348us/step - loss: 1.0163 - val_loss: 1.0236 Epoch 23/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 353us/step - loss: 0.9938 - val_loss: 1.0236 Epoch 24/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 355us/step - loss: 1.0016 - val_loss: 1.0236 Epoch 25/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 355us/step - loss: 0.9929 - val_loss: 1.0236 Epoch 26/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 362us/step - loss: 0.9988 - val_loss: 1.0236 Epoch 27/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 353us/step - loss: 0.9953 - val_loss: 1.0236 Epoch 28/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 444us/step - loss: 1.0125 - val_loss: 1.0236 Epoch 29/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 373us/step - loss: 1.0145 - val_loss: 1.0236 Epoch 30/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 368us/step - loss: 0.9976 - val_loss: 1.0236 Epoch 31/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 355us/step - loss: 1.0040 - val_loss: 1.0236 Epoch 32/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 351us/step - loss: 0.9900 - val_loss: 1.0236 Epoch 33/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 345us/step - loss: 0.9995 - val_loss: 1.0236 Epoch 34/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 352us/step - loss: 1.0074 - val_loss: 1.0236 Epoch 35/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 363us/step - loss: 1.0093 - val_loss: 1.0236 Epoch 36/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 362us/step - loss: 1.0168 - val_loss: 1.0236 Epoch 37/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 357us/step - loss: 0.9934 - val_loss: 1.0236 Epoch 38/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 394us/step - loss: 0.9876 - val_loss: 1.0236 Epoch 39/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 374us/step - loss: 0.9973 - val_loss: 1.0236 Epoch 40/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 373us/step - loss: 1.0041 - val_loss: 1.0236 Epoch 41/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 368us/step - loss: 0.9988 - val_loss: 1.0236 Epoch 42/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 356us/step - loss: 0.9813 - val_loss: 1.0236 Epoch 43/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 355us/step - loss: 1.0024 - val_loss: 1.0236 Epoch 44/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 352us/step - loss: 0.9981 - val_loss: 1.0236 Epoch 45/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 353us/step - loss: 0.9961 - val_loss: 1.0236 Epoch 46/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 351us/step - loss: 0.9931 - val_loss: 1.0236 Epoch 47/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 356us/step - loss: 0.9947 - val_loss: 1.0236 Epoch 48/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 356us/step - loss: 1.0011 - val_loss: 1.0236 Epoch 49/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 352us/step - loss: 1.0044 - val_loss: 1.0236 Epoch 50/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 353us/step - loss: 0.9825 - val_loss: 1.0236 Epoch 51/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 354us/step - loss: 1.0033 - val_loss: 1.0236 Epoch 52/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 353us/step - loss: 1.0066 - val_loss: 1.0236 Epoch 53/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 357us/step - loss: 1.0043 - val_loss: 1.0236 Epoch 54/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 354us/step - loss: 1.0005 - val_loss: 1.0236 Epoch 55/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 355us/step - loss: 1.0041 - val_loss: 1.0236 Epoch 56/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 355us/step - loss: 1.0079 - val_loss: 1.0236 Epoch 57/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 355us/step - loss: 1.0213 - val_loss: 1.0236 Epoch 58/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 358us/step - loss: 1.0049 - val_loss: 1.0236 Epoch 59/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 354us/step - loss: 1.0110 - val_loss: 1.0236 Epoch 60/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 361us/step - loss: 0.9978 - val_loss: 1.0236 Epoch 61/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 353us/step - loss: 1.0106 - val_loss: 1.0236 Epoch 62/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 365us/step - loss: 1.0068 - val_loss: 1.0236 Epoch 63/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 482us/step - loss: 0.9943 - val_loss: 1.0236 Epoch 64/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 354us/step - loss: 1.0123 - val_loss: 1.0236 Epoch 65/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 355us/step - loss: 1.0084 - val_loss: 1.0236 Epoch 66/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 351us/step - loss: 1.0025 - val_loss: 1.0236 Epoch 67/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 357us/step - loss: 0.9980 - val_loss: 1.0236 Epoch 68/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 367us/step - loss: 0.9887 - val_loss: 1.0236 Epoch 69/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 352us/step - loss: 1.0057 - val_loss: 1.0236 Epoch 70/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 355us/step - loss: 1.0071 - val_loss: 1.0236 Epoch 71/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 405us/step - loss: 0.9925 - val_loss: 1.0236 Epoch 72/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 357us/step - loss: 0.9862 - val_loss: 1.0236 Epoch 73/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 425us/step - loss: 0.9981 - val_loss: 1.0236 Epoch 74/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 362us/step - loss: 1.0103 - val_loss: 1.0236 Epoch 75/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 389us/step - loss: 0.9842 - val_loss: 1.0236 Epoch 76/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 357us/step - loss: 1.0198 - val_loss: 1.0236 Epoch 77/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 410us/step - loss: 0.9958 - val_loss: 1.0236 Epoch 78/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 363us/step - loss: 0.9944 - val_loss: 1.0236 Epoch 79/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 352us/step - loss: 0.9936 - val_loss: 1.0236 Epoch 80/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 361us/step - loss: 0.9860 - val_loss: 1.0236 Epoch 81/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 357us/step - loss: 0.9970 - val_loss: 1.0236 Epoch 82/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 360us/step - loss: 1.0146 - val_loss: 1.0236 Epoch 83/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 354us/step - loss: 0.9852 - val_loss: 1.0236 Epoch 84/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 356us/step - loss: 0.9896 - val_loss: 1.0236 Epoch 85/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 355us/step - loss: 1.0000 - val_loss: 1.0236 Epoch 86/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 352us/step - loss: 1.0012 - val_loss: 1.0236 Epoch 87/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 357us/step - loss: 1.0114 - val_loss: 1.0236 Epoch 88/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 358us/step - loss: 0.9845 - val_loss: 1.0236 Epoch 89/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 367us/step - loss: 1.0198 - val_loss: 1.0236 Epoch 90/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 352us/step - loss: 0.9945 - val_loss: 1.0236 Epoch 91/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 353us/step - loss: 0.9843 - val_loss: 1.0236 Epoch 92/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 358us/step - loss: 1.0006 - val_loss: 1.0236 Epoch 93/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 352us/step - loss: 0.9787 - val_loss: 1.0236 Epoch 94/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 350us/step - loss: 0.9935 - val_loss: 1.0236 Epoch 95/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 359us/step - loss: 0.9970 - val_loss: 1.0236 Epoch 96/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 3s 714us/step - loss: 1.0008 - val_loss: 1.0236 Epoch 97/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 373us/step - loss: 0.9962 - val_loss: 1.0236 Epoch 98/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 461us/step - loss: 1.0045 - val_loss: 1.0236 Epoch 99/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 501us/step - loss: 0.9937 - val_loss: 1.0236 Epoch 100/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 376us/step - loss: 1.0152 - val_loss: 1.0236 4077/4077 ━━━━━━━━━━━━━━━━━━━━ 3s 652us/step
AutoEncoder(batch_size=32, contamination=0.05, dropout_rate=0.2, epochs=100,
hidden_activation='relu', hidden_neurons=[25, 10, 2, 10, 25],
l2_regularizer=0.1,
loss=<function mean_squared_error at 0x16e35dbc0>, optimizer='adam',
output_activation='sigmoid', preprocessing=True, random_state=None,
validation_size=0.1, verbose=1)
# Get y train scores
y_train_scores2 = clf2.decision_scores_
threshold2 = clf2.threshold_
# get the prediction on the test data
y_test_pred2 = clf2.predict(X_test) # outlier labels (0 or 1)
# Predict anomaly score of X using fitted detector
y_test_scores2 = clf2.decision_function(X_test) # outlier scores
y_test_pred2 = pd.Series(y_test_pred2)
y_test_scores2 = pd.Series(y_test_scores2)
1020/1020 ━━━━━━━━━━━━━━━━━━━━ 0s 283us/step 1020/1020 ━━━━━━━━━━━━━━━━━━━━ 0s 171us/step
print(y_test_scores2.describe(), '\n')
print("Threshold for Contamination Rate:", clf2.threshold_, '\n')
clf2.get_params()
count 32613.000000 mean 2.257347 std 2.214619 min 0.035605 25% 0.990695 50% 1.493880 75% 2.679262 max 55.760895 dtype: float64 Threshold for Contamination Rate: 6.585046138271961
{'batch_size': 32,
'contamination': 0.05,
'dropout_rate': 0.2,
'epochs': 100,
'hidden_activation': 'relu',
'hidden_neurons': [25, 10, 2, 10, 25],
'l2_regularizer': 0.1,
'loss': <function keras.src.losses.losses.mean_squared_error(y_true, y_pred)>,
'optimizer': 'adam',
'output_activation': 'sigmoid',
'preprocessing': True,
'random_state': None,
'validation_size': 0.1,
'verbose': 1}
plot_histogram_with_threshold(np.array(y_test_scores2), "Anomaly Scores for Clf2 Model", threshold2)
Model 2 Stats¶
model_2 = descriptive_stat_threshold(X_test,y_test_scores2, threshold2)
model_2
| Group | Count | Count % | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Anomaly_Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Normal | 30997 | 95.044921 | -0.0703 | -0.076 | -0.0625 | -0.0079 | -0.0930 | -0.0741 | -0.0113 | -0.0572 | -0.0390 | -0.0040 | 1.8847 |
| 1 | Outlier | 1616 | 4.955079 | 1.3489 | 1.458 | 1.1988 | 0.1511 | 1.7835 | 1.4205 | 0.2177 | 1.0980 | 0.7479 | 0.0776 | 9.4060 |
4.3 Model 3 ¶
clf3 = AutoEncoder(contamination=0.05, hidden_neurons =[25, 15, 10, 2, 10, 15, 25])
clf3.fit(X_train)
Model: "sequential_2"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ dense_15 (Dense) │ (None, 10) │ 110 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dropout_13 (Dropout) │ (None, 10) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense_16 (Dense) │ (None, 10) │ 110 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dropout_14 (Dropout) │ (None, 10) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense_17 (Dense) │ (None, 25) │ 275 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dropout_15 (Dropout) │ (None, 25) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense_18 (Dense) │ (None, 15) │ 390 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dropout_16 (Dropout) │ (None, 15) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense_19 (Dense) │ (None, 10) │ 160 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dropout_17 (Dropout) │ (None, 10) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense_20 (Dense) │ (None, 2) │ 22 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dropout_18 (Dropout) │ (None, 2) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense_21 (Dense) │ (None, 10) │ 30 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dropout_19 (Dropout) │ (None, 10) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense_22 (Dense) │ (None, 15) │ 165 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dropout_20 (Dropout) │ (None, 15) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense_23 (Dense) │ (None, 25) │ 400 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dropout_21 (Dropout) │ (None, 25) │ 0 │ ├─────────────────────────────────┼───────────────────────────┼────────────┤ │ dense_24 (Dense) │ (None, 10) │ 260 │ └─────────────────────────────────┴───────────────────────────┴────────────┘
Total params: 1,922 (7.51 KB)
Trainable params: 1,922 (7.51 KB)
Non-trainable params: 0 (0.00 B)
None Epoch 1/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 422us/step - loss: 8.7458 - val_loss: 1.3073 Epoch 2/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 404us/step - loss: 1.2296 - val_loss: 1.0455 Epoch 3/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 463us/step - loss: 1.0214 - val_loss: 1.0036 Epoch 4/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 412us/step - loss: 1.0065 - val_loss: 0.9988 Epoch 5/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 405us/step - loss: 0.9820 - val_loss: 0.9981 Epoch 6/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 408us/step - loss: 0.9898 - val_loss: 0.9980 Epoch 7/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 406us/step - loss: 1.0113 - val_loss: 0.9980 Epoch 8/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 402us/step - loss: 0.9922 - val_loss: 0.9980 Epoch 9/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 407us/step - loss: 0.9925 - val_loss: 0.9980 Epoch 10/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 535us/step - loss: 0.9955 - val_loss: 0.9979 Epoch 11/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 425us/step - loss: 0.9814 - val_loss: 0.9979 Epoch 12/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 418us/step - loss: 0.9953 - val_loss: 0.9979 Epoch 13/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 415us/step - loss: 0.9804 - val_loss: 0.9979 Epoch 14/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 400us/step - loss: 1.0039 - val_loss: 0.9979 Epoch 15/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 405us/step - loss: 1.0154 - val_loss: 0.9979 Epoch 16/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 410us/step - loss: 0.9973 - val_loss: 0.9979 Epoch 17/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 405us/step - loss: 1.0224 - val_loss: 0.9979 Epoch 18/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 407us/step - loss: 1.0069 - val_loss: 0.9979 Epoch 19/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 407us/step - loss: 0.9886 - val_loss: 0.9979 Epoch 20/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 403us/step - loss: 1.0006 - val_loss: 0.9979 Epoch 21/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 401us/step - loss: 0.9965 - val_loss: 0.9979 Epoch 22/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 410us/step - loss: 1.0030 - val_loss: 0.9979 Epoch 23/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 475us/step - loss: 1.0041 - val_loss: 0.9979 Epoch 24/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 421us/step - loss: 0.9931 - val_loss: 0.9979 Epoch 25/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 527us/step - loss: 1.0115 - val_loss: 0.9979 Epoch 26/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 428us/step - loss: 1.0209 - val_loss: 0.9979 Epoch 27/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 407us/step - loss: 0.9817 - val_loss: 0.9979 Epoch 28/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 405us/step - loss: 1.0061 - val_loss: 0.9979 Epoch 29/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 405us/step - loss: 0.9922 - val_loss: 0.9979 Epoch 30/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 404us/step - loss: 0.9945 - val_loss: 0.9979 Epoch 31/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 405us/step - loss: 1.0085 - val_loss: 0.9979 Epoch 32/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 439us/step - loss: 1.0008 - val_loss: 0.9979 Epoch 33/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 418us/step - loss: 1.0013 - val_loss: 0.9979 Epoch 34/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 413us/step - loss: 1.0056 - val_loss: 0.9979 Epoch 35/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 408us/step - loss: 1.0070 - val_loss: 0.9979 Epoch 36/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 407us/step - loss: 1.0053 - val_loss: 0.9979 Epoch 37/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 407us/step - loss: 1.0043 - val_loss: 0.9979 Epoch 38/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 405us/step - loss: 0.9988 - val_loss: 0.9979 Epoch 39/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 405us/step - loss: 1.0040 - val_loss: 0.9979 Epoch 40/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 402us/step - loss: 1.0058 - val_loss: 0.9979 Epoch 41/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 404us/step - loss: 0.9919 - val_loss: 0.9979 Epoch 42/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 413us/step - loss: 1.0083 - val_loss: 0.9979 Epoch 43/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 612us/step - loss: 1.0175 - val_loss: 0.9979 Epoch 44/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 3s 682us/step - loss: 0.9834 - val_loss: 0.9979 Epoch 45/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 452us/step - loss: 0.9951 - val_loss: 0.9979 Epoch 46/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 458us/step - loss: 0.9848 - val_loss: 0.9979 Epoch 47/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 421us/step - loss: 1.0215 - val_loss: 0.9979 Epoch 48/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 506us/step - loss: 1.0114 - val_loss: 0.9979 Epoch 49/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 407us/step - loss: 1.0051 - val_loss: 0.9979 Epoch 50/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 403us/step - loss: 1.0061 - val_loss: 0.9979 Epoch 51/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 408us/step - loss: 1.0102 - val_loss: 0.9979 Epoch 52/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 400us/step - loss: 1.0008 - val_loss: 0.9979 Epoch 53/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 399us/step - loss: 1.0064 - val_loss: 0.9979 Epoch 54/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 437us/step - loss: 0.9945 - val_loss: 0.9979 Epoch 55/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 398us/step - loss: 0.9881 - val_loss: 0.9979 Epoch 56/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 400us/step - loss: 1.0059 - val_loss: 0.9979 Epoch 57/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 400us/step - loss: 1.0082 - val_loss: 0.9979 Epoch 58/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 399us/step - loss: 1.0085 - val_loss: 0.9979 Epoch 59/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 398us/step - loss: 0.9843 - val_loss: 0.9979 Epoch 60/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 400us/step - loss: 1.0111 - val_loss: 0.9979 Epoch 61/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 399us/step - loss: 1.0046 - val_loss: 0.9979 Epoch 62/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 408us/step - loss: 1.0008 - val_loss: 0.9979 Epoch 63/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 399us/step - loss: 0.9842 - val_loss: 0.9979 Epoch 64/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 404us/step - loss: 0.9973 - val_loss: 0.9979 Epoch 65/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 400us/step - loss: 0.9965 - val_loss: 0.9979 Epoch 66/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 3s 717us/step - loss: 0.9918 - val_loss: 0.9979 Epoch 67/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 429us/step - loss: 1.0023 - val_loss: 0.9979 Epoch 68/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 413us/step - loss: 0.9989 - val_loss: 0.9979 Epoch 69/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 461us/step - loss: 0.9998 - val_loss: 0.9979 Epoch 70/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 3s 910us/step - loss: 0.9970 - val_loss: 0.9979 Epoch 71/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 565us/step - loss: 1.0104 - val_loss: 0.9979 Epoch 72/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 478us/step - loss: 0.9850 - val_loss: 0.9979 Epoch 73/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 434us/step - loss: 1.0053 - val_loss: 0.9979 Epoch 74/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 436us/step - loss: 0.9883 - val_loss: 0.9979 Epoch 75/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 461us/step - loss: 1.0017 - val_loss: 0.9979 Epoch 76/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 447us/step - loss: 1.0210 - val_loss: 0.9979 Epoch 77/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 437us/step - loss: 1.0006 - val_loss: 0.9979 Epoch 78/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 447us/step - loss: 1.0070 - val_loss: 0.9979 Epoch 79/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 1s 401us/step - loss: 1.0038 - val_loss: 0.9979 Epoch 80/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 464us/step - loss: 0.9899 - val_loss: 0.9979 Epoch 81/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 488us/step - loss: 1.0153 - val_loss: 0.9979 Epoch 82/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 453us/step - loss: 1.0067 - val_loss: 0.9979 Epoch 83/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 436us/step - loss: 1.0045 - val_loss: 0.9979 Epoch 84/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 540us/step - loss: 1.0186 - val_loss: 0.9979 Epoch 85/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 461us/step - loss: 0.9907 - val_loss: 0.9979 Epoch 86/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 497us/step - loss: 1.0027 - val_loss: 0.9979 Epoch 87/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 462us/step - loss: 0.9925 - val_loss: 0.9979 Epoch 88/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 418us/step - loss: 1.0269 - val_loss: 0.9979 Epoch 89/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 450us/step - loss: 1.0073 - val_loss: 0.9979 Epoch 90/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 442us/step - loss: 0.9823 - val_loss: 0.9979 Epoch 91/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 427us/step - loss: 1.0087 - val_loss: 0.9979 Epoch 92/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 430us/step - loss: 1.0061 - val_loss: 0.9979 Epoch 93/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 445us/step - loss: 0.9937 - val_loss: 0.9979 Epoch 94/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 437us/step - loss: 1.0031 - val_loss: 0.9979 Epoch 95/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 445us/step - loss: 1.0143 - val_loss: 0.9979 Epoch 96/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 427us/step - loss: 1.0064 - val_loss: 0.9979 Epoch 97/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 424us/step - loss: 0.9956 - val_loss: 0.9979 Epoch 98/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 425us/step - loss: 0.9956 - val_loss: 0.9979 Epoch 99/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 419us/step - loss: 0.9847 - val_loss: 0.9979 Epoch 100/100 3669/3669 ━━━━━━━━━━━━━━━━━━━━ 2s 421us/step - loss: 1.0009 - val_loss: 0.9979 4077/4077 ━━━━━━━━━━━━━━━━━━━━ 1s 182us/step
AutoEncoder(batch_size=32, contamination=0.05, dropout_rate=0.2, epochs=100,
hidden_activation='relu', hidden_neurons=[25, 15, 10, 2, 10, 15, 25],
l2_regularizer=0.1,
loss=<function mean_squared_error at 0x16e35dbc0>, optimizer='adam',
output_activation='sigmoid', preprocessing=True, random_state=None,
validation_size=0.1, verbose=1)
# Get y train scores
y_train_scores3 = clf3.decision_scores_
threshold3 = clf3.threshold_
# get the prediction on the test data
y_test_pred3 = clf3.predict(X_test) # outlier labels (0 or 1)
# Predict anomaly score of X using fitted detector
y_test_scores3 = clf3.decision_function(X_test) # outlier scores
y_test_pred3 = pd.Series(y_test_pred3)
y_test_scores3 = pd.Series(y_test_scores3)
1020/1020 ━━━━━━━━━━━━━━━━━━━━ 0s 218us/step 1020/1020 ━━━━━━━━━━━━━━━━━━━━ 0s 166us/step
print(y_test_scores3.describe(), '\n')
print("Threshold for Contamination Rate:", clf3.threshold_, '\n')
clf3.get_params()
count 32613.000000 mean 2.257346 std 2.214620 min 0.035614 25% 0.990717 50% 1.493887 75% 2.679276 max 55.760933 dtype: float64 Threshold for Contamination Rate: 6.585096522818382
{'batch_size': 32,
'contamination': 0.05,
'dropout_rate': 0.2,
'epochs': 100,
'hidden_activation': 'relu',
'hidden_neurons': [25, 15, 10, 2, 10, 15, 25],
'l2_regularizer': 0.1,
'loss': <function keras.src.losses.losses.mean_squared_error(y_true, y_pred)>,
'optimizer': 'adam',
'output_activation': 'sigmoid',
'preprocessing': True,
'random_state': None,
'validation_size': 0.1,
'verbose': 1}
plot_histogram_with_threshold(np.array(y_test_scores3), "Anomaly Scores for Clf3 Model", clf3.threshold_)
Model 3 Stats¶
model_3 = descriptive_stat_threshold(X_test,y_test_scores3, threshold3)
model_3
| Group | Count | Count % | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Anomaly_Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Normal | 30997 | 95.044921 | -0.0703 | -0.076 | -0.0625 | -0.0079 | -0.0930 | -0.0741 | -0.0113 | -0.0572 | -0.0390 | -0.0040 | 1.8847 |
| 1 | Outlier | 1616 | 4.955079 | 1.3489 | 1.458 | 1.1988 | 0.1511 | 1.7835 | 1.4205 | 0.2177 | 1.0980 | 0.7479 | 0.0776 | 9.4060 |
4.4 Aggregate Multiple Models ¶
# Put all the predictions in a data frame
train_scores = pd.DataFrame({'clf1': clf1.decision_scores_,
'clf2': clf2.decision_scores_,
'clf3': clf3.decision_scores_
})
test_scores = pd.DataFrame({'clf1': clf1.decision_function(X_test),
'clf2': clf2.decision_function(X_test),
'clf3': clf3.decision_function(X_test)
})
1020/1020 ━━━━━━━━━━━━━━━━━━━━ 0s 397us/step 1020/1020 ━━━━━━━━━━━━━━━━━━━━ 0s 180us/step 1020/1020 ━━━━━━━━━━━━━━━━━━━━ 0s 166us/step
train_scores.head()
| clf1 | clf2 | clf3 | |
|---|---|---|---|
| 0 | 0.816857 | 0.816857 | 0.816852 |
| 1 | 4.866587 | 4.866563 | 4.866580 |
| 2 | 3.215261 | 3.215289 | 3.215277 |
| 3 | 2.800463 | 2.800432 | 2.800467 |
| 4 | 1.299050 | 1.299050 | 1.299045 |
test_scores.head()
| clf1 | clf2 | clf3 | |
|---|---|---|---|
| 0 | 1.261936 | 1.261924 | 1.261927 |
| 1 | 1.462396 | 1.462378 | 1.462387 |
| 2 | 0.892089 | 0.892093 | 0.892095 |
| 3 | 0.953255 | 0.953249 | 0.953248 |
| 4 | 0.921061 | 0.921044 | 0.921054 |
4.5 Combination by Average ¶
# Standardize the decision scores
train_scores_norm, test_scores_norm = standardizer(train_scores,test_scores)
# Combination by average
y_train_by_average = average(train_scores_norm)
y_test_by_average = average(test_scores_norm)
y_test_by_average[1:10]
array([-0.35827573, -0.61566098, -0.588059 , -0.60259031, -0.77464724,
-0.57074008, -0.53622878, 0.75240693, -0.16169526])
contamination = 0.05
threshold_index = int((1 - contamination) * len(y_train_by_average))
threshold_avg = np.sort(y_train_by_average)[threshold_index]
threshold_avg
1.9541912732491085
# Function to plot histograms with threshold annotation and color differentiation
def plot_histogram_with_threshold_new(scores, title, threshold):
fig = go.Figure()
# Split the scores into two parts: below the threshold and above/equal to the threshold
below_threshold = scores[scores < threshold]
above_threshold = scores[scores >= threshold]
# Add histogram traces for the two parts
fig.add_trace(go.Histogram(x=below_threshold, name='Normal', marker=dict(color='navy')))
fig.add_trace(go.Histogram(x=above_threshold, name='Outliers', marker=dict(color='Magenta')))
# Add vertical line annotating the threshold
fig.add_vline(x=threshold, line_dash="dash", line_color="#0d5087", annotation_text="Threshold",
annotation_position="top right", annotation=dict(font=dict(color="#27496d", size=12)))
# Update layout
fig.update_layout(
title=title,
xaxis_title="Anomaly Score",
yaxis_title="Frequency",
showlegend=True,
height=450,
width=800,
paper_bgcolor='rgba(0,0,0,0)',
plot_bgcolor='rgba(0,0,0,0)'
)
fig.update_xaxes(range=[-0.1, 10])
# Show the plot
fig.show()
plot_histogram_with_threshold_new(np.array(y_train_by_average), "Anomaly Scores for Combination by Average", threshold_avg)
Combination by Avg Stats¶
auto_avg=descriptive_stat_threshold(X_test,y_test_by_average, threshold_avg)
auto_avg
| Group | Count | Count % | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Anomaly_Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Normal | 30997 | 95.044921 | -0.0703 | -0.076 | -0.0625 | -0.0079 | -0.0930 | -0.0741 | -0.0113 | -0.0572 | -0.0390 | -0.0040 | -0.1677 |
| 1 | Outlier | 1616 | 4.955079 | 1.3489 | 1.458 | 1.1988 | 0.1511 | 1.7835 | 1.4205 | 0.2177 | 1.0980 | 0.7479 | 0.0776 | 3.2268 |
4.5.1 Combination by AVG Outlier List ¶
# Add a column 'Prediction_Comb_Avg' to final_df and initialize with 0 (not outlier)
final_df['Prediction_Comb_Avg'] = 0
# Mark outliers in training data
final_df.loc[X_train.index[y_train_by_average >= threshold_avg], 'Prediction_Comb_Avg'] = 1
# Mark outliers in test data
final_df.loc[X_test.index[y_test_by_average >= threshold_avg], 'Prediction_Comb_Avg'] = 1
final_df.head()
| DRG | Provider_Id | Provider_Name | Provider_StreetAddress | Provider_City | Provider_State | Provider_Zipcode | Hospital_referral_region_desp | Total_Discharges | Average_Covered_Charges | ... | Average_Medicare_Bystates_Ratio | Average_Medicare_Bystatescity_Ratio | Average_Medicare_Byzip_Ratio | Average_Total_Bystates_Ratio | Average_Total_Bystatescity_Ratio | Average_Total_Byzip_Ratio | Average_OOP_Bystates_Ratio | Average_OOP_Bystatescity_Ratio | Average_OOP_Byzip_Ratio | Prediction_Comb_Avg | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10001 | SOUTHEAST ALABAMA MEDICAL CENTER | 1108 ROSS CLARK CIRCLE | DOTHAN | AL | 36301 | AL - Dothan | 91 | 32963.07 | ... | 1.035685 | 1.038763 | 1.0 | 1.005855 | 1.037810 | 1.0 | 0.885921 | 1.033356 | 1.0 | 0 |
| 1 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10005 | MARSHALL MEDICAL CENTER SOUTH | 2505 U S HIGHWAY 431 NORTH | BOAZ | AL | 35957 | AL - Birmingham | 14 | 15131.85 | ... | 1.081989 | 1.000000 | 1.0 | 1.007653 | 1.000000 | 1.0 | 0.708782 | 1.000000 | 1.0 | 0 |
| 2 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10006 | ELIZA COFFEE MEMORIAL HOSPITAL | 205 MARENGO STREET | FLORENCE | AL | 35631 | AL - Birmingham | 24 | 37560.37 | ... | 0.968301 | 1.000000 | 1.0 | 0.946260 | 1.000000 | 1.0 | 0.857643 | 1.000000 | 1.0 | 0 |
| 3 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10011 | ST VINCENT'S EAST | 50 MEDICAL PARK EAST DRIVE | BIRMINGHAM | AL | 35235 | AL - Birmingham | 25 | 13998.28 | ... | 0.897723 | 0.899200 | 1.0 | 0.943232 | 0.915208 | 1.0 | 1.126205 | 0.970586 | 1.0 | 0 |
| 4 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10016 | SHELBY BAPTIST MEDICAL CENTER | 1000 FIRST STREET NORTH | ALABASTER | AL | 35007 | AL - Birmingham | 18 | 31633.27 | ... | 1.054754 | 1.000000 | 1.0 | 0.985152 | 1.000000 | 1.0 | 0.705312 | 1.000000 | 1.0 | 0 |
5 rows × 24 columns
# Filter out outliers
outliers_comb = final_df[final_df['Prediction_Comb_Avg'] == 1]
# Display the list of outliers
outliers_list_comb = outliers_comb[['DRG', 'Provider_Id', 'Provider_Name','Provider_State', 'Provider_City', 'Provider_Zipcode'] + ratio_columns + ['Prediction_Comb_Avg']]
print("List of Outliers:")
outliers_list_comb.head(5)
List of Outliers:
| DRG | Provider_Id | Provider_Name | Provider_State | Provider_City | Provider_Zipcode | Average_Medicare_byDRG_Ratio | Average_Medicare_Bystates_Ratio | Average_Medicare_Bystatescity_Ratio | Average_Medicare_Byzip_Ratio | Average_Total_Bystates_Ratio | Average_Total_Bystatescity_Ratio | Average_Total_Byzip_Ratio | Average_OOP_Bystates_Ratio | Average_OOP_Bystatescity_Ratio | Average_OOP_Byzip_Ratio | Prediction_Comb_Avg | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10046 | RIVERVIEW REGIONAL MEDICAL CENTER | AL | GADSDEN | 35901 | 0.808802 | 0.976949 | 1.012007 | 1.0 | 0.950895 | 0.992776 | 1.0 | 0.846140 | 0.912300 | 1.0 | 1 |
| 14 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10083 | SOUTH BALDWIN REGIONAL MEDICAL CENTER | AL | FOLEY | 36535 | 0.789031 | 0.953069 | 1.000000 | 1.0 | 0.919792 | 1.000000 | 1.0 | 0.786001 | 1.000000 | 1.0 | 1 |
| 34 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 30038 | SCOTTSDALE HEALTHCARE-OSBORN MEDICAL CENTER | AZ | SCOTTSDALE | 85251 | 0.985090 | 0.925614 | 1.028059 | 1.0 | 0.906303 | 1.042642 | 1.0 | 0.822265 | 1.120502 | 1.0 | 1 |
| 43 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 30094 | ARROWHEAD HOSPITAL | AZ | GLENDALE | 85308 | 1.127600 | 1.059519 | 1.033826 | 1.0 | 0.968981 | 1.009958 | 1.0 | 0.574981 | 0.852184 | 1.0 | 1 |
| 53 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 40026 | ST JOSEPHS MERCY HEALTH CENTER INC | AR | HOT SPRINGS | 71903 | 0.856703 | 0.963753 | 0.716093 | 1.0 | 0.948941 | 0.742293 | 1.0 | 0.886955 | 0.890442 | 1.0 | 1 |
print(f' The Combination by AVG Outlier list contains {len(outliers_list_comb.groupby("Provider_Name").count())} providers who performed {len(outliers_list_comb.groupby("DRG").count())} procedures accross {len(outliers_list_comb.groupby("Provider_State").count())} states', '\n')
The Combination by AVG Outlier list contains 2470 providers who performed 100 procedures accross 51 states
4.5.2 Top 10 Outlier by DRG ¶
outliers_list_comb.groupby('DRG').agg(count = pd.NamedAgg(column = 'DRG', aggfunc = 'count')).sort_values(by= 'count', ascending= False)[1:11]
| count | |
|---|---|
| DRG | |
| 191 - CHRONIC OBSTRUCTIVE PULMONARY DISEASE W CC | 282 |
| 192 - CHRONIC OBSTRUCTIVE PULMONARY DISEASE W/O CC/MCC | 278 |
| 193 - SIMPLE PNEUMONIA & PLEURISY W MCC | 252 |
| 194 - SIMPLE PNEUMONIA & PLEURISY W CC | 212 |
| 189 - PULMONARY EDEMA & RESPIRATORY FAILURE | 199 |
| 177 - RESPIRATORY INFECTIONS & INFLAMMATIONS W MCC | 193 |
| 065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W CC | 192 |
| 069 - TRANSIENT ISCHEMIA | 190 |
| 178 - RESPIRATORY INFECTIONS & INFLAMMATIONS W CC | 180 |
| 066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W/O CC/MCC | 178 |
4.5.3 Outlier by State ¶
outliers_list_comb.groupby('Provider_State').agg(count = pd.NamedAgg(column = 'Provider_State', aggfunc = 'count')).sort_values(by= 'count', ascending= False)[1:11]
| count | |
|---|---|
| Provider_State | |
| CA | 674 |
| TX | 588 |
| FL | 546 |
| NY | 429 |
| IL | 390 |
| PA | 350 |
| OH | 336 |
| NC | 282 |
| GA | 262 |
| MI | 250 |
| NJ | 245 |
| MO | 221 |
| VA | 221 |
| TN | 200 |
| IN | 200 |
| MD | 176 |
| AL | 176 |
| KY | 175 |
| MA | 165 |
| SC | 162 |
| LA | 140 |
| AZ | 138 |
| WI | 135 |
| MS | 125 |
| OK | 120 |
| WA | 113 |
| AR | 110 |
| IA | 102 |
| MN | 102 |
| CT | 101 |
| KS | 84 |
| CO | 82 |
| WV | 78 |
| OR | 57 |
| NH | 56 |
| NE | 53 |
| NV | 52 |
| ME | 48 |
| NM | 44 |
| UT | 30 |
| ID | 29 |
| RI | 28 |
| SD | 27 |
| HI | 26 |
| MT | 24 |
| ND | 21 |
| DC | 20 |
| AK | 18 |
| VT | 12 |
| DE | 12 |
| WY | 10 |
4.6 Combination by Max ¶
# Combination by max
y_by_maximization = maximization(test_scores_norm)
y_by_maximization
array([-0.44874417, -0.35827236, -0.61565891, ..., -0.71297992,
-0.2599202 , -0.37496343])
plot_histogram_with_threshold_new(np.array(y_by_maximization), "Anomaly Scores for Combination by Max", threshold_avg)
Combination by Max Stats¶
auto_max = descriptive_stat_threshold(X_test,y_by_maximization, threshold_avg)
auto_max
| Group | Count | Count % | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Anomaly_Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Normal | 30997 | 95.044921 | -0.0703 | -0.076 | -0.0625 | -0.0079 | -0.0930 | -0.0741 | -0.0113 | -0.0572 | -0.0390 | -0.0040 | -0.1677 |
| 1 | Outlier | 1616 | 4.955079 | 1.3489 | 1.458 | 1.1988 | 0.1511 | 1.7835 | 1.4205 | 0.2177 | 1.0980 | 0.7479 | 0.0776 | 3.2268 |
Key Takeways
- All 3 models were tested using different range of hidden layers and ecoding-decoding the data
- To validate the scores, all models were combined by taking the average and max scores to reduce the chance of overfitting or underfitting the data
- As shown on the stats table above, both average and max models were able to captured 1,616 outliers with a 3.23 anomaly score compared to -0.17 anomaly score of the normal group, which accounted for 0.5% of the data
Significant Features for identifying outliers based on the provided data are:
- Avg Med by DRG
- Avg Med by State
- Avg Med by State, City
- Avg Total by State
- Avg Total by State, City
These features show the largest differences between the normal and outlier groups, indicating they are strong indicators of anomalous behavior in the dataset
Isolation Forest is an ensemble-based anomaly detection method that isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. It is based on the fact that anomalies are few and different, thus they are more susceptible to isolation.
Use Cases:
- Anomaly Detection
- Preprocessing: Identify and removing anomalies before further data processing or modeling
isft = IForest(contamination=0.05, max_samples=40, behaviour='new')
isft.fit(X_train)
# Training data
y_train_scores_isft = isft.decision_function(X_train)
y_train_pred_isft = isft.predict(X_train)
# Test data
y_test_scores_isft = isft.decision_function(X_test)
y_test_pred_isft = isft.predict(X_test) # outlier labels (0 or 1)
# Threshold for the defined comtanimation rate
forest_threshold = isft.threshold_
print("The threshold for the defined contamination rate:" , isft.threshold_)
The threshold for the defined contamination rate: -4.8620114688870467e-17
isft.get_params()
{'behaviour': 'new',
'bootstrap': False,
'contamination': 0.05,
'max_features': 1.0,
'max_samples': 40,
'n_estimators': 100,
'n_jobs': 1,
'random_state': None,
'verbose': 0}
isft_vi = isft.feature_importances_
isft_vi
array([0.13968889, 0.12664838, 0.10514943, 0.05940842, 0.11974233,
0.11893943, 0.04756913, 0.11662462, 0.11054464, 0.05568474])
Fearture Importance¶
# Define the feature names corresponding to 0-9
feature_names = [
'Average_Medicare_byDRG_Ratio',
'Average_Medicare_Bystates_Ratio',
'Average_Medicare_Bystatescity_Ratio',
'Average_Medicare_Byzip_Ratio',
'Average_Total_Bystates_Ratio',
'Average_Total_Bystatescity_Ratio',
'Average_Total_Byzip_Ratio',
'Average_OOP_Bystates_Ratio',
'Average_OOP_Bystatescity_Ratio',
'Average_OOP_Byzip_Ratio'
]
# Create a DataFrame for plotting
for_plot = pd.DataFrame({'x_axis': feature_names, 'y_axis': isft_vi}).sort_values(by='y_axis', ascending=True)
# Create a horizontal bar plot using Plotly Express
fig = px.bar(for_plot, x='y_axis', y='x_axis', orientation='h', labels={'x_axis': 'Features', 'y_axis': 'Importance'},
title='Feature Importances')
# Update layout for better visualization
fig.update_traces(marker_color='navy')
fig.update_layout(template='plotly_white', paper_bgcolor='rgba(0,0,0,0)', plot_bgcolor='rgba(0,0,0,0)')
# Show the plot
fig.show()
iForest Stats¶
forest = descriptive_stat_threshold(X_test,y_test_scores_isft, forest_threshold)
forest
| Group | Count | Count % | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Anomaly_Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Normal | 31023 | 95.124644 | -0.0899 | -0.1045 | -0.1027 | -0.0338 | -0.1119 | -0.1106 | -0.0408 | -0.0336 | -0.0289 | -0.0078 | -0.1998 |
| 1 | Outlier | 1590 | 4.875356 | 1.7541 | 2.0381 | 2.0043 | 0.6594 | 2.1835 | 2.1571 | 0.7968 | 0.6551 | 0.5637 | 0.1516 | 0.0449 |
5.2 Combination by Average ¶
# Test a range of maximum samples
k_list = [20, 30, 40, 50, 60]
k_list = [100, 200, 300, 400, 500]
n_clf = len(k_list)
# Just prepare data frames so we can store the model results
train_scores = np.zeros([X_train.shape[0], n_clf])
test_scores = np.zeros([X_test.shape[0], n_clf])
# Modeling
for i in range(n_clf):
k = k_list[i]
#isft = IForest(contamination=0.05, max_samples=k)
isft = IForest(contamination=0.05, n_estimators=k)
isft.fit(X_train)
# Store the results in each column:
train_scores[:, i] = isft.decision_function(X_train)
test_scores[:, i] = isft.decision_function(X_test)
# Decision scores have to be normalized before combination
train_scores_norm, test_scores_norm = standardizer(train_scores,test_scores)
y_train_by_average_isft = average(train_scores_norm)
y_test_by_average_isft = average(test_scores_norm)
contamination = 0.05
threshold_index = int((1 - contamination) * len(y_train_by_average_isft))
threshold_forest_avg = np.sort(y_train_by_average_isft)[threshold_index]
threshold_forest_avg
2.263033002370265
plot_histogram_with_threshold_new(np.array(y_test_by_average_isft), "Anomaly Scores for iForest Combination by Avg", threshold_forest_avg)
iForest Combination by Avg Stats¶
forest_avg = descriptive_stat_threshold(X_test,y_test_by_average_isft, threshold_forest_avg)
forest_avg
| Group | Count | Count % | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Anomaly_Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Normal | 31029 | 95.143041 | -0.0721 | -0.0762 | -0.0580 | -0.0078 | -0.0892 | -0.0709 | -0.0136 | -0.0419 | -0.0358 | -0.0090 | -0.1454 |
| 1 | Outlier | 1584 | 4.856959 | 1.4120 | 1.4934 | 1.1361 | 0.1523 | 1.7481 | 1.3893 | 0.2663 | 0.8209 | 0.7006 | 0.1763 | 3.0876 |
5.2.1 Combination by AVG Outlier List ¶
# Add a column 'Prediction_Comb_Avg' to final_df and initialize with 0 (not outlier)
final_df['Prediction_Forest_Avg'] = 0
# Mark outliers in training data
final_df.loc[X_train.index[y_train_by_average_isft >= threshold_avg], 'Prediction_Forest_Avg'] = 1
# Mark outliers in test data
final_df.loc[X_test.index[y_test_by_average_isft >= threshold_avg], 'Prediction_Forest_Avg'] = 1
final_df.head()
| DRG | Provider_Id | Provider_Name | Provider_StreetAddress | Provider_City | Provider_State | Provider_Zipcode | Hospital_referral_region_desp | Total_Discharges | Average_Covered_Charges | ... | Average_Medicare_Bystatescity_Ratio | Average_Medicare_Byzip_Ratio | Average_Total_Bystates_Ratio | Average_Total_Bystatescity_Ratio | Average_Total_Byzip_Ratio | Average_OOP_Bystates_Ratio | Average_OOP_Bystatescity_Ratio | Average_OOP_Byzip_Ratio | Prediction_Comb_Avg | Prediction_Forest_Avg | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10001 | SOUTHEAST ALABAMA MEDICAL CENTER | 1108 ROSS CLARK CIRCLE | DOTHAN | AL | 36301 | AL - Dothan | 91 | 32963.07 | ... | 1.038763 | 1.0 | 1.005855 | 1.037810 | 1.0 | 0.885921 | 1.033356 | 1.0 | 0 | 0 |
| 1 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10005 | MARSHALL MEDICAL CENTER SOUTH | 2505 U S HIGHWAY 431 NORTH | BOAZ | AL | 35957 | AL - Birmingham | 14 | 15131.85 | ... | 1.000000 | 1.0 | 1.007653 | 1.000000 | 1.0 | 0.708782 | 1.000000 | 1.0 | 0 | 0 |
| 2 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10006 | ELIZA COFFEE MEMORIAL HOSPITAL | 205 MARENGO STREET | FLORENCE | AL | 35631 | AL - Birmingham | 24 | 37560.37 | ... | 1.000000 | 1.0 | 0.946260 | 1.000000 | 1.0 | 0.857643 | 1.000000 | 1.0 | 0 | 0 |
| 3 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10011 | ST VINCENT'S EAST | 50 MEDICAL PARK EAST DRIVE | BIRMINGHAM | AL | 35235 | AL - Birmingham | 25 | 13998.28 | ... | 0.899200 | 1.0 | 0.943232 | 0.915208 | 1.0 | 1.126205 | 0.970586 | 1.0 | 0 | 0 |
| 4 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10016 | SHELBY BAPTIST MEDICAL CENTER | 1000 FIRST STREET NORTH | ALABASTER | AL | 35007 | AL - Birmingham | 18 | 31633.27 | ... | 1.000000 | 1.0 | 0.985152 | 1.000000 | 1.0 | 0.705312 | 1.000000 | 1.0 | 0 | 0 |
5 rows × 25 columns
# Filter out outliers
outliers_forest = final_df[final_df['Prediction_Forest_Avg'] == 1]
# Display the list of outliers
outliers_list_forest = outliers_forest[['DRG', 'Provider_Id', 'Provider_Name','Provider_State', 'Provider_City', 'Provider_Zipcode'] + ratio_columns + ['Prediction_Forest_Avg']]
print("List of Outliers:")
outliers_list_forest.head(5)
List of Outliers:
| DRG | Provider_Id | Provider_Name | Provider_State | Provider_City | Provider_Zipcode | Average_Medicare_byDRG_Ratio | Average_Medicare_Bystates_Ratio | Average_Medicare_Bystatescity_Ratio | Average_Medicare_Byzip_Ratio | Average_Total_Bystates_Ratio | Average_Total_Bystatescity_Ratio | Average_Total_Byzip_Ratio | Average_OOP_Bystates_Ratio | Average_OOP_Bystatescity_Ratio | Average_OOP_Byzip_Ratio | Prediction_Forest_Avg | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10046 | RIVERVIEW REGIONAL MEDICAL CENTER | AL | GADSDEN | 35901 | 0.808802 | 0.976949 | 1.012007 | 1.0 | 0.950895 | 0.992776 | 1.0 | 0.846140 | 0.912300 | 1.0 | 1 |
| 12 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10056 | ST VINCENT'S BIRMINGHAM | AL | BIRMINGHAM | 35205 | 0.753445 | 0.910085 | 0.911582 | 1.0 | 0.935761 | 0.907959 | 1.0 | 1.038995 | 0.895427 | 1.0 | 1 |
| 14 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 10083 | SOUTH BALDWIN REGIONAL MEDICAL CENTER | AL | FOLEY | 36535 | 0.789031 | 0.953069 | 1.000000 | 1.0 | 0.919792 | 1.000000 | 1.0 | 0.786001 | 1.000000 | 1.0 | 1 |
| 34 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 30038 | SCOTTSDALE HEALTHCARE-OSBORN MEDICAL CENTER | AZ | SCOTTSDALE | 85251 | 0.985090 | 0.925614 | 1.028059 | 1.0 | 0.906303 | 1.042642 | 1.0 | 0.822265 | 1.120502 | 1.0 | 1 |
| 43 | 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC | 30094 | ARROWHEAD HOSPITAL | AZ | GLENDALE | 85308 | 1.127600 | 1.059519 | 1.033826 | 1.0 | 0.968981 | 1.009958 | 1.0 | 0.574981 | 0.852184 | 1.0 | 1 |
print(f' The Combination by iForest AVG Outlier list contains {len(outliers_list_forest.groupby("Provider_Name").count())} providers who performed {len(outliers_list_forest.groupby("DRG").count())} procedures accross {len(outliers_list_forest.groupby("Provider_State").count())} states', '\n')
The Combination by iForest AVG Outlier list contains 2616 providers who performed 100 procedures accross 51 states
5.2.2 Top 10 Outliers by DRG¶
outliers_list_forest.groupby('DRG').agg(count = pd.NamedAgg(column = 'DRG', aggfunc = 'count')).sort_values(by= 'count', ascending= False)[1:11]
| count | |
|---|---|
| DRG | |
| 191 - CHRONIC OBSTRUCTIVE PULMONARY DISEASE W CC | 360 |
| 190 - CHRONIC OBSTRUCTIVE PULMONARY DISEASE W MCC | 354 |
| 193 - SIMPLE PNEUMONIA & PLEURISY W MCC | 331 |
| 194 - SIMPLE PNEUMONIA & PLEURISY W CC | 265 |
| 065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W CC | 257 |
| 189 - PULMONARY EDEMA & RESPIRATORY FAILURE | 254 |
| 069 - TRANSIENT ISCHEMIA | 248 |
| 066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W/O CC/MCC | 243 |
| 178 - RESPIRATORY INFECTIONS & INFLAMMATIONS W CC | 241 |
| 177 - RESPIRATORY INFECTIONS & INFLAMMATIONS W MCC | 239 |
5.2.3 Outliers by State¶
outliers_list_forest.groupby('Provider_State').agg(count = pd.NamedAgg(column = 'Provider_State', aggfunc = 'count')).sort_values(by= 'count', ascending= False)[1:11]
| count | |
|---|---|
| Provider_State | |
| CA | 863 |
| TX | 734 |
| FL | 700 |
| NY | 568 |
| IL | 511 |
| PA | 480 |
| OH | 441 |
| NC | 358 |
| MI | 346 |
| GA | 333 |
| NJ | 313 |
| VA | 307 |
| MO | 275 |
| IN | 269 |
| TN | 249 |
| AL | 231 |
| MA | 230 |
| KY | 229 |
| MD | 225 |
| SC | 221 |
| AZ | 180 |
| WI | 179 |
| LA | 176 |
| MS | 156 |
| WA | 155 |
| OK | 155 |
| AR | 147 |
| MN | 139 |
| CT | 128 |
| IA | 120 |
| CO | 115 |
| KS | 103 |
| WV | 99 |
| NV | 79 |
| OR | 72 |
| NH | 70 |
| NE | 68 |
| ME | 63 |
| NM | 58 |
| UT | 44 |
| ID | 41 |
| RI | 38 |
| HI | 36 |
| SD | 34 |
| MT | 30 |
| ND | 28 |
| DC | 27 |
| AK | 19 |
| VT | 17 |
| DE | 15 |
| WY | 11 |
Key Takeways
- As shown on the stats table above, both average and max models were able to captured 1,583 outliers with a 3.06 anomaly score compared to -0.15 anomaly score of the normal group, which accounted for 0.5% of the data
Significant Features for identifying outliers based on the provided data are:
- Avg Med by DRG
- Avg Med by State
- Avg Med by State, City
- Avg Total by State
- Avg Total by State, City
- Avg OOP by State
These features show the largest differences between the normal and outlier groups, indicating they are strong indicators of anomalous behavior in the dataset
Section 6 Model Comparison ¶
# Put the actual, the HBO score and the ECOD score together
Actual_preds = pd.DataFrame({'AutoEcoder_Score': y_test_scores,
'iForest_Score': y_test_scores_isft})
Actual_preds['AutoEncoder_pred'] = np.where(Actual_preds['AutoEcoder_Score']>clf1.threshold_,1,0)
Actual_preds['iForest_pred'] = np.where(Actual_preds['iForest_Score']>isft.threshold_,1,0)
Actual_preds.head()
| AutoEcoder_Score | iForest_Score | AutoEncoder_pred | iForest_pred | |
|---|---|---|---|---|
| 0 | 1.261936 | -0.259086 | 0 | 0 |
| 1 | 1.462396 | -0.226800 | 0 | 0 |
| 2 | 0.892089 | -0.247386 | 0 | 0 |
| 3 | 0.953255 | -0.266837 | 0 | 0 |
| 4 | 0.921061 | -0.246449 | 0 | 0 |
# Compute confusion matrix
cm = pd.crosstab(Actual_preds['AutoEncoder_pred'],Actual_preds['iForest_pred'])
# Plot confusion matrix as heatmap
trace2 = go.Heatmap(z=cm,
x=['Predicted 0', 'Predicted 1'],
y=['True 0', 'True 1'],
showscale=False,
colorscale=[
[0.0, "navy"],
[0.005, "Magenta"],
[1.0, "#ffffff"],
],
xgap=20,
ygap=20,
text=cm,
texttemplate="%{text}")
# Define layout
layout = go.Layout(
title=dict(text="Confusion Matrix between AutoEcoder and iForest", x=0.5, y=0.9, xanchor='center', yanchor='top'),
xaxis=dict(title='iForest Pred', showticklabels=True),
yaxis=dict(title='AutoEcoder Pred', showticklabels=True),
autosize=False,
width=500,
height=500,
plot_bgcolor='rgba(0,0,0,0)',
paper_bgcolor='rgba(0,0,0,0)'
)
# Plot heatmap
fig = go.Figure(data=[trace2], layout=layout)
fig.show()
Key Takeaways
- Both AutoEcoder and iForest models were able to identify more than 30k instances as normal along with ~1.2k instances as outliers with some discrepancies between false negatives and false positive.